1. Introduction
Bank direct marketing is an interactive process of building beneficial relationships among stakeholders. Effective multichannel communication involves the study of customer characteristics and behavior. Apart from profit growth, which may raise customer loyalty and positive responses the goal of bank direct marketing is to increase the response rates of direct promotion campaigns.
The usage of data visualization by decision makers and their organizations offers many benefits, that includes absorbing information in new and constructive ways. Visualizing relationships and patterns between operational and business activities can help identify and act on emerging trends. Visualization also enables users to manipulate and interact with data directly and fosters a new business language to tell the most relevant story. The choice of a proper visualization technique depends on many factors, such as the type of data (numerical or categorical), the nature of the domain of interest, and the final visualization purpose, which may involve plotting of the distribution of data points or comparing different attributes over the same data point.
2. Goals & Objective
Goal
Create an exploratory data analysis from data to derived strategy in which the company is able to identify customers who are more likely to subscribe is desirable and would allow greater focus on those customers most likely to generate a sale.
Objective
Create an exploratory data analysis Using R.
3. Methodology
The first was to import and to do a quick cheching for the that we using.
Next a preprocessing phase is first implemented to balance the data distribution
After that we will transform the data for the usage according to the business question related.
4. Exploratory Data Analysis
Read Library
## Warning: package 'flexdashboard' was built under R version 3.6.3
## -- Attaching packages ------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.4
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'readr' was built under R version 3.6.3
## -- Conflicts ---------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
##
## Attaching package: 'glue'
## The following object is masked from 'package:dplyr':
##
## collapse
## Warning: package 'ggmosaic' was built under R version 3.6.3
## Warning: package 'gmodels' was built under R version 3.6.3
Read Data
bank_data = read.csv(file = "bank-additional-full.csv",
sep = ";",
stringsAsFactors = F)
bank_data = bank_data %>%
mutate(y = factor(if_else(y == "yes", "1", "0"),
levels = c("0", "1")))
bank_data$job <- as.factor(bank_data$job)
CrossTable(bank_data$y)##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 41188
##
##
## | 0 | 1 |
## |-----------|-----------|
## | 36548 | 4640 |
## | 0.887 | 0.113 |
## |-----------|-----------|
##
##
##
##
This is an unbalanced two-levels categorical variable, 88.7% of values taken are “no” (or “0”) and only 11.3% of the values are “yes” (or “1”). It is more natural to work with a 0/1 dependent variable:
Finding out which variables suffer from the missing value the most
bank_data %>%
summarise_all(list(~sum(. == "unknown"))) %>%
gather(key = "variable", value = "nr_unknown") %>%
arrange(-nr_unknown)## variable nr_unknown
## 1 default 8597
## 2 education 1731
## 3 housing 990
## 4 loan 990
## 5 job 330
## 6 marital 80
## 7 age 0
## 8 contact 0
## 9 month 0
## 10 day_of_week 0
## 11 duration 0
## 12 campaign 0
## 13 pdays 0
## 14 previous 0
## 15 poutcome 0
## 16 emp.var.rate 0
## 17 cons.price.idx 0
## 18 cons.conf.idx 0
## 19 euribor3m 0
## 20 nr.employed 0
## 21 y 0
#Function
# default theme for ggplot
theme_set(theme_bw())
#Theme Algoritma
theme_algoritma <- theme(legend.key = element_rect(fill="black"),
legend.background = element_rect(color="white", fill="#263238"),
plot.subtitle = element_text(size=6, color="white"),
panel.background = element_rect(fill="#dddddd"),
panel.border = element_rect(fill=NA),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(color="darkgrey", linetype=2),
panel.grid.minor.y = element_blank(),
plot.background = element_rect(fill="#263238"),
text = element_text(color="white"),
axis.text = element_text(color="white")
)
# setting default parameters for mosaic plots
mosaic_theme = theme(axis.text.x = element_text(angle = 90,
hjust = 1,
vjust = 0.5),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
# setting default parameters for crosstables
fun_crosstable = function(df, var1, var2){
# df: dataframe containing both columns to cross
# var1, var2: columns to cross together.
CrossTable(df[, var1], df[, var2],
prop.r = T,
prop.c = F,
prop.t = F,
prop.chisq = F,
dnn = c(var1, var2))
}a. Age
Customer Age Profile
Plot Analysis by Age
bank_data_age = bank_data %>%
mutate(age = if_else(age > 60, "high", if_else(age > 30, "mid", "low")))bank_data_age = bank_data_age %>%
filter(age != "unknown")
plot_age <- ggplot(bank_data_age) +
geom_mosaic(aes(x = product(y, age), fill = y)) +
xlab("age") +
ylab(NULL)
ggplotly(plot_age)45.5% of people over 60 years old subscribed a term deposit, which is a lot in comparison with younger individuals (15.2% for young adults (aged lower than 30) and only 9.4% for the remaining observations (aged between 30 and 60)).
b. Job
Customer Job Distribution Profile
#Count Customer Job Frequency
bankjob <-count(bank_data,bank_data$job)
#Count yes or no frequency based on jobs
bank_job_yesno <- count(bank_data,bank_data$job,bank_data$y)
# Convert wide to long
bank_job_wide <- pivot_wider(data = bank_job_yesno, names_from = `bank_data$y`, values_from = n)
# Assigning bank job frequency to bank_job_wide
bank_job_wide <- bank_job_wide %>%
mutate(freq = bankjob$n)
names(bank_job_wide)[1] <- "jobs"
names(bank_job_wide)[4] <- "frequency"
# Create distribution of customers by job
bank_job_wide$jobs <- reorder(bank_job_wide$jobs, -bank_job_wide$frequency)
plotjobs <- ggplot(bank_job_wide,aes(`jobs`, frequency))+
geom_col((aes(fill = jobs))) +
theme(axis.text.x = element_text(angle = 45, hjust=1))
ggplotly(plotjobs)Plot Analysis by job
### Distribution of Customer
names(bank_job_wide)[2] <- "no"
names(bank_job_wide)[3] <- "yes"
bank_job_wide <- bank_job_wide %>%
mutate(total = no + yes) %>%
mutate(percentage_no = no / total * 100) %>%
mutate(percentage_yes = yes / total * 100)
# bank_job_wide$total <- bank_job_wide$no + bank_job_wide$yes
# bank_job_wide$percentage_no <- bank_job_wide$no/bank_job_wide$total*100
# bank_job_wide$percentage_yes <- bank_job_wide$yes/bank_job_wide$total*100
plotjobs2<- ggplot(bank_job_wide, aes(`jobs`, percentage_yes)) +
geom_point(aes(color = jobs, size = frequency)) +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
labs( title = "Distribution of customer by jobs",
x = "Jobs",
y = "yes Percentage",
color = "jobs", size = "frequency") +
theme(text = element_text(face = "bold"))
ggplotly(plotjobs2)plot_job_2 <- bank_data %>%
ggplot() +
geom_mosaic(aes(x = product(y,job), fill = (y))) +
mosaic_theme
ggplotly(plot_job_2)Even though admin and blue collar received the highest frequency of call, we can see that those who more likely to subscribe are student and retired.
Surprisingly, students (31.4%), retired people (25.2%) and unemployed (14.2%) categories show the best relative frequencies of term deposit subscription.
c. Marital
Customer Marital Distribution Profile
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 41188
##
##
## | y
## marital | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## divorced | 4136 | 476 | 4612 |
## | 0.897 | 0.103 | 0.112 |
## -------------|-----------|-----------|-----------|
## married | 22396 | 2532 | 24928 |
## | 0.898 | 0.102 | 0.605 |
## -------------|-----------|-----------|-----------|
## single | 9948 | 1620 | 11568 |
## | 0.860 | 0.140 | 0.281 |
## -------------|-----------|-----------|-----------|
## unknown | 68 | 12 | 80 |
## | 0.850 | 0.150 | 0.002 |
## -------------|-----------|-----------|-----------|
## Column Total | 36548 | 4640 | 41188 |
## -------------|-----------|-----------|-----------|
##
##
Customer Marital Profile
#Count Customer Married Frequency
bank_marital <-count(bank_data,bank_data$marital)
#Count yes or no frequency based on jobs
bank_marital_yesno <- count(bank_data,bank_data$marital,bank_data$y)
# Convert wide to long
bank_marital_wide <- pivot_wider(data = bank_marital_yesno, names_from = `bank_data$y`, values_from = n)
# Assigning bank job frequency to bank_job_wide
bank_marital_wide <- bank_marital_wide %>%
mutate(freq = bank_marital$n)
names(bank_marital_wide)[1] <- "marital"
names(bank_marital_wide)[4] <- "frequency"
# Create distribution of customers by job
plotmarital <- ggplot(bank_marital_wide,aes(`marital`, frequency))+
geom_col() +
theme(axis.text.x = element_text(angle = 45, hjust=1))
plotmaritala <- bank_data %>%
ggplot() +
aes(x = marital, y = ..count../nrow(bank_data), fill = y) +
geom_bar() +
ylab("relative frequency")
ggplotly(plotmaritala)Plot Analysis by Marital
# <br> Frequency :{bank_marital_long$value}"
plotmarital3<- bank_data %>%
ggplot() +
geom_mosaic(aes(x = product(y, marital), fill = y,
)) +
mosaic_theme +
xlab("Marital status") +
ylab(NULL)
ggplotly(plotmarital3, tooltip = "text")### Distribution of Customer
names(bank_marital_wide)[2] <- "no"
names(bank_marital_wide)[3] <- "yes"
bank_marital_wide <- bank_marital_wide %>%
mutate(total = no + yes) %>%
mutate(percentage_no = no / total * 100) %>%
mutate(percentage_yes = yes / total * 100)
bank_marital_long <- bank_marital_wide %>%
pivot_longer(cols = c(percentage_yes, percentage_no),
names_to = "percentage",
values_to = "value" ) %>%
mutate(text = glue(
"Marital status : {marital}
Total : {frequency}
percentage : {round(value,2)}%"))
plotmarital3 <- ggplot(bank_marital_long, aes(x = marital,
y = value,
text = text)) +
geom_col(aes(fill = percentage), position = "dodge") +
coord_flip() +
labs(x = NULL,
y = NULL,
title = "Marital Distribution of Customer") +
theme(legend.position = "none") +
theme_algoritma
ggplotly(plotmarital3, tooltip = "text")# plotmarital3 <- ggplot(bank_marital_long, aes(x = marital,
# y = value)) +
# geom_point(aes(color = marital, size = frequency, position = "dodge")) +
# coord_flip() +
# labs(x = NULL,
# y = NULL,
# title = "Marital Distribution of Customer") +
# theme(legend.position = "none") +
# theme_algoritma
#
# ggplotly(plotmarital3)
plotmarital2<- ggplot(bank_marital_wide, aes(`marital`, percentage_yes)) +
geom_point(aes(color = marital, size = frequency)) +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
labs( title = "Distribution of customer by marital",
x = "Marital",
y = "yes Percentage",
color = "jobs", size = "frequency") +
theme(text = element_text(face = "bold"))
ggplotly(plotmarital2)d. Education
Customer Education Distribution Profile
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 41108
##
##
## | y
## education | 0 | 1 | Row Total |
## --------------------|-----------|-----------|-----------|
## basic.4y | 3743 | 427 | 4170 |
## | 0.898 | 0.102 | 0.101 |
## --------------------|-----------|-----------|-----------|
## basic.6y | 2098 | 188 | 2286 |
## | 0.918 | 0.082 | 0.056 |
## --------------------|-----------|-----------|-----------|
## basic.9y | 5566 | 471 | 6037 |
## | 0.922 | 0.078 | 0.147 |
## --------------------|-----------|-----------|-----------|
## high.school | 8471 | 1030 | 9501 |
## | 0.892 | 0.108 | 0.231 |
## --------------------|-----------|-----------|-----------|
## illiterate | 14 | 4 | 18 |
## | 0.778 | 0.222 | 0.000 |
## --------------------|-----------|-----------|-----------|
## professional.course | 4642 | 595 | 5237 |
## | 0.886 | 0.114 | 0.127 |
## --------------------|-----------|-----------|-----------|
## university.degree | 10473 | 1664 | 12137 |
## | 0.863 | 0.137 | 0.295 |
## --------------------|-----------|-----------|-----------|
## unknown | 1473 | 249 | 1722 |
## | 0.855 | 0.145 | 0.042 |
## --------------------|-----------|-----------|-----------|
## Column Total | 36480 | 4628 | 41108 |
## --------------------|-----------|-----------|-----------|
##
##
bank_data = bank_data %>%
filter(education != "illiterate")
bank_data = bank_data %>%
mutate(education = recode(education, "unknown" = "university.degree"))
#Count Customer Education Frequency
bank_education <-count(bank_data,bank_data$education)
#Count yes or no frequency based on jobs
bank_education_yesno <- count(bank_data,bank_data$education,bank_data$y)
# Convert wide to long
bank_education_wide <- pivot_wider(data = bank_education_yesno, names_from = `bank_data$y`, values_from = n)
# Assigning bank job frequency to bank_job_wide
bank_education_wide <- bank_education_wide %>%
mutate(freq = bank_education$n)
names(bank_education_wide)[1] <- "education"
names(bank_education_wide)[4] <- "frequency"
# Create distribution of customers by job
ploteducation <- ggplot(bank_education_wide,aes(`education`, frequency))+
geom_col() +
theme(axis.text.x = element_text(angle = 45, hjust=1))
ggplotly(ploteducation)Plot Analysis by Education
bank_data %>%
ggplot() +
geom_mosaic(aes(x = product(y, education), fill = y)) +
mosaic_theme +
xlab("Education") +
ylab(NULL)### Distribution of Customer
names(bank_education_wide)[2] <- "no"
names(bank_education_wide)[3] <- "yes"
bank_education_wide <- bank_education_wide %>%
mutate(total = no + yes) %>%
mutate(percentage_no = no / total * 100) %>%
mutate(percentage_yes = yes / total * 100)
bank_education_long <- pivot_longer(data = bank_education_wide,
cols = c(percentage_yes, percentage_no),
names_to = "percentage",
values_to = "value" )
ploteducation3 <- ggplot(bank_education_long, aes(x = education,
y = value)) +
geom_col(aes(fill = percentage), position = "fill") +
coord_flip() +
labs(x = NULL,
y = NULL,
title = "Education Distribution of Customer") +
theme(legend.position = "none") +
theme_algoritma
ggplotly(ploteducation3)# plotmarital3 <- ggplot(bank_marital_long, aes(x = marital,
# y = value)) +
# geom_point(aes(color = marital, size = frequency, position = "dodge")) +
# coord_flip() +
# labs(x = NULL,
# y = NULL,
# title = "Marital Distribution of Customer") +
# theme(legend.position = "none") +
# theme_algoritma
#
# ggplotly(plotmarital3)
ploteducation2<- ggplot(bank_education_wide, aes(`education`, percentage_yes)) +
geom_point(aes(color = education, size = frequency)) +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
labs( title = "Distribution of customer by marital",
x = "Education",
y = "yes Percentage",
color = "jobs", size = "frequency") +
theme(text = element_text(face = "bold"))
ggplotly(ploteducation2)We can see that there is correlation between the higher education and the probability of subscriptions. ## e. Default Does the client have a credit in default?
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 41090
##
##
## | y
## default | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## no | 28326 | 4182 | 32508 |
## | 0.871 | 0.129 | 0.791 |
## -------------|-----------|-----------|-----------|
## unknown | 8137 | 442 | 8579 |
## | 0.948 | 0.052 | 0.209 |
## -------------|-----------|-----------|-----------|
## yes | 3 | 0 | 3 |
## | 1.000 | 0.000 | 0.000 |
## -------------|-----------|-----------|-----------|
## Column Total | 36466 | 4624 | 41090 |
## -------------|-----------|-----------|-----------|
##
##
Feature certainly not usable because only 3 people replied with yes
f. Housing
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 41090
##
##
## | y
## housing | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## no | 16552 | 2018 | 18570 |
## | 0.891 | 0.109 | 0.452 |
## -------------|-----------|-----------|-----------|
## unknown | 882 | 107 | 989 |
## | 0.892 | 0.108 | 0.024 |
## -------------|-----------|-----------|-----------|
## yes | 19032 | 2499 | 21531 |
## | 0.884 | 0.116 | 0.524 |
## -------------|-----------|-----------|-----------|
## Column Total | 36466 | 4624 | 41090 |
## -------------|-----------|-----------|-----------|
##
##
##
## Pearson's Chi-squared test
##
## data: bank_data$housing and bank_data$y
## X-squared = 5.6515, df = 2, p-value = 0.05926
Since the p-value is abocve 5 percent, for confidence level 95 %, we can conclude that there’s no association between the dependent variable y and our feature housing.
g. Contact
How was the client contacted?
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 41090
##
##
## | y
## contact | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## cellular | 22237 | 3839 | 26076 |
## | 0.853 | 0.147 | 0.635 |
## -------------|-----------|-----------|-----------|
## telephone | 14229 | 785 | 15014 |
## | 0.948 | 0.052 | 0.365 |
## -------------|-----------|-----------|-----------|
## Column Total | 36466 | 4624 | 41090 |
## -------------|-----------|-----------|-----------|
##
##
This feature is really interesting, 14.7% of cellular responders subscribed to a term deposit while only 5.2% of telephone responders did.
h. Month
Customer Month Distribution Profile
month_recode = c("jan" = "(01)jan",
"feb" = "(02)feb",
"mar" = "(03)mar",
"apr" = "(04)apr",
"may" = "(05)may",
"jun" = "(06)jun",
"jul" = "(07)jul",
"aug" = "(08)aug",
"sep" = "(09)sep",
"oct" = "(10)oct",
"nov" = "(11)nov",
"dec" = "(12)dec")
bank_data = bank_data %>%
mutate(month = recode(month, !!!month_recode))
fun_crosstable(bank_data, "month", "y")##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 41090
##
##
## | y
## month | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## (03)mar | 269 | 274 | 543 |
## | 0.495 | 0.505 | 0.013 |
## -------------|-----------|-----------|-----------|
## (04)apr | 2089 | 538 | 2627 |
## | 0.795 | 0.205 | 0.064 |
## -------------|-----------|-----------|-----------|
## (05)may | 12849 | 884 | 13733 |
## | 0.936 | 0.064 | 0.334 |
## -------------|-----------|-----------|-----------|
## (06)jun | 4748 | 558 | 5306 |
## | 0.895 | 0.105 | 0.129 |
## -------------|-----------|-----------|-----------|
## (07)jul | 6513 | 647 | 7160 |
## | 0.910 | 0.090 | 0.174 |
## -------------|-----------|-----------|-----------|
## (08)aug | 5514 | 649 | 6163 |
## | 0.895 | 0.105 | 0.150 |
## -------------|-----------|-----------|-----------|
## (09)sep | 314 | 256 | 570 |
## | 0.551 | 0.449 | 0.014 |
## -------------|-----------|-----------|-----------|
## (10)oct | 401 | 314 | 715 |
## | 0.561 | 0.439 | 0.017 |
## -------------|-----------|-----------|-----------|
## (11)nov | 3676 | 415 | 4091 |
## | 0.899 | 0.101 | 0.100 |
## -------------|-----------|-----------|-----------|
## (12)dec | 93 | 89 | 182 |
## | 0.511 | 0.489 | 0.004 |
## -------------|-----------|-----------|-----------|
## Column Total | 36466 | 4624 | 41090 |
## -------------|-----------|-----------|-----------|
##
##
plotmonth<- bank_data %>%
ggplot() +
aes(x = month, y = ..count../nrow(bank_data), fill = y) +
geom_bar() +
ylab("relative frequency")
ggplotly(plotmonth)#Count Customer Month Frequency
bank_month <-count(bank_data,bank_data$month)
#Count yes or no frequency based on jobs
bank_month_yesno <- count(bank_data,bank_data$month,bank_data$y)
# Convert wide to long
bank_month_wide <- pivot_wider(data = bank_month_yesno, names_from = `bank_data$y`, values_from = n)
# Assigning bank job frequency to bank_job_wide
bank_month_wide <- bank_month_wide %>%
mutate(freq = bank_month$n)
names(bank_month_wide)[1] <- "month"
names(bank_month_wide)[4] <- "frequency"
# Create distribution of customers by job
plotmonth <- ggplot(bank_month_wide,aes(`month`, frequency))+
geom_col() +
theme(axis.text.x = element_text(angle = 45, hjust=1))Plot Analysis by month
### Distribution of Customer
names(bank_month_wide)[2] <- "no"
names(bank_month_wide)[3] <- "yes"
bank_month_wide <- bank_month_wide %>%
mutate(total = no + yes) %>%
mutate(percentage_no = no / total * 100) %>%
mutate(percentage_yes = yes / total * 100)
bank_month_long <- pivot_longer(data = bank_month_wide,
cols = c(percentage_yes, percentage_no),
names_to = "percentage",
values_to = "value" )
plotmonth3 <- ggplot(bank_month_long, aes(x = month,
y = value)) +
geom_col(aes(fill = percentage), position = "fill") +
coord_flip() +
labs(x = NULL,
y = NULL,
title = "Month Distribution of Customer") +
theme(legend.position = "none") +
theme_algoritma
ggplotly(plotmonth3)# plotmarital3 <- ggplot(bank_marital_long, aes(x = marital,
# y = value)) +
# geom_point(aes(color = marital, size = frequency, position = "dodge")) +
# coord_flip() +
# labs(x = NULL,
# y = NULL,
# title = "Marital Distribution of Customer") +
# theme(legend.position = "none") +
# theme_algoritma
#
# ggplotly(plotmarital3)
plotmonth2<- ggplot(bank_month_wide, aes(`month`, percentage_yes)) +
geom_point(aes(color = month, size = frequency)) +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
labs( title = "Distribution of customer by month",
x = "Month",
y = "yes Percentage",
color = "jobs", size = "frequency") +
theme(text = element_text(face = "bold"))
ggplotly(plotmonth2)i. Day of the Week
Customer Day of the week distribution profile
day_recode = c("mon" = "(01)mon",
"tue" = "(02)tue",
"wed" = "(03)wed",
"thu" = "(04)thu",
"fri" = "(05)fri")
bank_data = bank_data %>%
mutate(day_of_week = recode(day_of_week, !!!day_recode))
fun_crosstable(bank_data, "day_of_week", "y")##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 41090
##
##
## | y
## day_of_week | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## (01)mon | 7650 | 844 | 8494 |
## | 0.901 | 0.099 | 0.207 |
## -------------|-----------|-----------|-----------|
## (02)tue | 7122 | 952 | 8074 |
## | 0.882 | 0.118 | 0.196 |
## -------------|-----------|-----------|-----------|
## (03)wed | 7174 | 944 | 8118 |
## | 0.884 | 0.116 | 0.198 |
## -------------|-----------|-----------|-----------|
## (04)thu | 7553 | 1040 | 8593 |
## | 0.879 | 0.121 | 0.209 |
## -------------|-----------|-----------|-----------|
## (05)fri | 6967 | 844 | 7811 |
## | 0.892 | 0.108 | 0.190 |
## -------------|-----------|-----------|-----------|
## Column Total | 36466 | 4624 | 41090 |
## -------------|-----------|-----------|-----------|
##
##
Calls aren’t made during weekend days. If calls are evenly distributed between the different week days, Thursdays tend to show better results (12.1% of subscribers among calls made this day) unlike Mondays with only 10.0% of successful calls. However, those differences are small, which makes this feature not that important. It would’ve been interesting to see the attitude of responders from weekend calls.
Plot Analysis by Day of Week
plotdowa<- bank_data %>%
ggplot() +
aes(x = day_of_week, y = ..count../nrow(bank_data), fill = y) +
geom_bar() +
ylab("relative frequency")
ggplotly(plotdowa)#Count Customer Dow Frequency
bank_dow <-count(bank_data,bank_data$day_of_week)
#Count yes or no frequency based on jobs
bank_dow_yesno <- count(bank_data,bank_data$day_of_week,bank_data$y)
# Convert wide to long
bank_dow_wide <- pivot_wider(data = bank_dow_yesno, names_from = `bank_data$y`, values_from = n)
# Assigning bank job frequency to bank_job_wide
bank_dow_wide <- bank_dow_wide %>%
mutate(freq = bank_dow$n)
names(bank_dow_wide)[1] <- "day_of_week"
names(bank_dow_wide)[4] <- "frequency"
# Create distribution of customers by job
plotdow <- ggplot(bank_dow_wide,aes(`day_of_week`, frequency))+
geom_col() +
theme(axis.text.x = element_text(angle = 45, hjust=1))### Distribution of Customer
names(bank_dow_wide)[2] <- "no"
names(bank_dow_wide)[3] <- "yes"
bank_dow_wide <- bank_dow_wide %>%
mutate(total = no + yes) %>%
mutate(percentage_no = no / total * 100) %>%
mutate(percentage_yes = yes / total * 100)
bank_dow_long <- pivot_longer(data = bank_dow_wide,
cols = c(percentage_yes, percentage_no),
names_to = "percentage",
values_to = "value" )
plotdow3 <- ggplot(bank_dow_long, aes(x = day_of_week,
y = value)) +
geom_col(aes(fill = percentage), position = "fill") +
coord_flip() +
labs(x = NULL,
y = NULL,
title = "Day of week Distribution of Customer") +
theme(legend.position = "none") +
theme_algoritma
ggplotly(plotdow3)# plotmarital3 <- ggplot(bank_marital_long, aes(x = marital,
# y = value)) +
# geom_point(aes(color = marital, size = frequency, position = "dodge")) +
# coord_flip() +
# labs(x = NULL,
# y = NULL,
# title = "Marital Distribution of Customer") +
# theme(legend.position = "none") +
# theme_algoritma
#
# ggplotly(plotmarital3)
plotdow2<- ggplot(bank_dow_wide, aes(`day_of_week`, percentage_yes)) +
geom_point(aes(color = day_of_week, size = frequency)) +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
labs( title = "Distribution of customer by day of week",
x = "Day of week",
y = "yes Percentage",
color = "day_of_week", size = "frequency") +
theme(text = element_text(face = "bold"))
ggplotly(plotdow2)5. Conclusion
From this exploratory we have derived a lot of information from the data and visualize it in a way to make a strategy in which the company is able to identify customers who are more likely to subscribe is desirable and would allow greater focus on those customers most likely to generate a sale.
age : 45.5% of people over 60 years old subscribed a term deposit, which is a lot in comparison with younger individuals (15.2% for young adults (aged lower than 30) and only 9.4% for the remaining observations (aged between 30 and 60)).
jobs : Even though admin and blue collar received the highest frequency of call, we can see that those who more likely to subscribe are student and retired.
Surprisingly, students (31.4%), retired people (25.2%) and unemployed (14.2%) categories show the best relative frequencies of term deposit subscription.
marital : Celibates slightly subscribe more often (14.0%) to term deposits than others (divorced (10.3%) and married (10.2%)).
education : It appears that a positive correlation between the number of years of education and the odds to subscribe to a term deposit exists.
month : The highest spike occurs during May, but it also has the worst ratio of subscribers over persons contacted. Surprisingly every month with a low frequency of contact (March, September, October and December) show good results.
contact : Thursday tends to show better results (12.1% of subscribers made this day)