1. Introduction

Bank direct marketing is an interactive process of building beneficial relationships among stakeholders. Effective multichannel communication involves the study of customer characteristics and behavior. Apart from profit growth, which may raise customer loyalty and positive responses the goal of bank direct marketing is to increase the response rates of direct promotion campaigns.

The usage of data visualization by decision makers and their organizations offers many benefits, that includes absorbing information in new and constructive ways. Visualizing relationships and patterns between operational and business activities can help identify and act on emerging trends. Visualization also enables users to manipulate and interact with data directly and fosters a new business language to tell the most relevant story. The choice of a proper visualization technique depends on many factors, such as the type of data (numerical or categorical), the nature of the domain of interest, and the final visualization purpose, which may involve plotting of the distribution of data points or comparing different attributes over the same data point.

2. Goals & Objective

Goal
Create an exploratory data analysis from data to derived strategy in which the company is able to identify customers who are more likely to subscribe is desirable and would allow greater focus on those customers most likely to generate a sale.

Objective

Create an exploratory data analysis Using R.

3. Methodology

The first was to import and to do a quick cheching for the that we using.

Next a preprocessing phase is first implemented to balance the data distribution

After that we will transform the data for the usage according to the business question related.

4. Exploratory Data Analysis

Read Library

library(flexdashboard)

## Warning: package 'flexdashboard' was built under R version 3.6.3

library(tidyverse)

## -- Attaching packages ------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## Warning: package 'readr' was built under R version 3.6.3

## -- Conflicts ---------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(glue)

## 
## Attaching package: 'glue'

## The following object is masked from 'package:dplyr':
## 
##     collapse

library(ggmosaic)

## Warning: package 'ggmosaic' was built under R version 3.6.3

library(plotly)
library(glue)
library(gmodels) # For crosstable analysis

## Warning: package 'gmodels' was built under R version 3.6.3

library(readr)

Read Data

bank_data = read.csv(file = "bank-additional-full.csv",
                     sep = ";",
                     stringsAsFactors = F)

bank_data = bank_data %>% 
  mutate(y = factor(if_else(y == "yes", "1", "0"), 
                    levels = c("0", "1")))

bank_data$job <- as.factor(bank_data$job)
CrossTable(bank_data$y)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41188 
## 
##  
##           |         0 |         1 | 
##           |-----------|-----------|
##           |     36548 |      4640 | 
##           |     0.887 |     0.113 | 
##           |-----------|-----------|
## 
## 
## 
##

This is an unbalanced two-levels categorical variable, 88.7% of values taken are “no” (or “0”) and only 11.3% of the values are “yes” (or “1”). It is more natural to work with a 0/1 dependent variable:

Finding out which variables suffer from the missing value the most

bank_data %>% 
  summarise_all(list(~sum(. == "unknown"))) %>% 
  gather(key = "variable", value = "nr_unknown") %>% 
  arrange(-nr_unknown)

##          variable nr_unknown
## 1         default       8597
## 2       education       1731
## 3         housing        990
## 4            loan        990
## 5             job        330
## 6         marital         80
## 7             age          0
## 8         contact          0
## 9           month          0
## 10    day_of_week          0
## 11       duration          0
## 12       campaign          0
## 13          pdays          0
## 14       previous          0
## 15       poutcome          0
## 16   emp.var.rate          0
## 17 cons.price.idx          0
## 18  cons.conf.idx          0
## 19      euribor3m          0
## 20    nr.employed          0
## 21              y          0

#Function

# default theme for ggplot
theme_set(theme_bw())


#Theme Algoritma
theme_algoritma <- theme(legend.key = element_rect(fill="black"),
           legend.background = element_rect(color="white", fill="#263238"),
           plot.subtitle = element_text(size=6, color="white"),
           panel.background = element_rect(fill="#dddddd"),
           panel.border = element_rect(fill=NA),
           panel.grid.minor.x = element_blank(),
           panel.grid.major.x = element_blank(),
           panel.grid.major.y = element_line(color="darkgrey", linetype=2),
           panel.grid.minor.y = element_blank(),
           plot.background = element_rect(fill="#263238"),
           text = element_text(color="white"),
           axis.text = element_text(color="white")
          
           )

# setting default parameters for mosaic plots
mosaic_theme = theme(axis.text.x = element_text(angle = 90,
                                                hjust = 1,
                                                vjust = 0.5),
                     axis.text.y = element_blank(),
                     axis.ticks.y = element_blank())

# setting default parameters for crosstables
fun_crosstable = function(df, var1, var2){
  # df: dataframe containing both columns to cross
  # var1, var2: columns to cross together.
  CrossTable(df[, var1], df[, var2],
             prop.r = T,
             prop.c = F,
             prop.t = F,
             prop.chisq = F,
             dnn = c(var1, var2))
}

a. Age

Customer Age Profile

plotage <- bank_data %>% 
  ggplot() +
  aes(x = age) +
  geom_bar() +
  geom_vline(xintercept = c(30, 60), 
             col = "red",
             linetype = "dashed") +
  facet_grid(y ~ .,
             scales = "free_y") +
  scale_x_continuous(breaks = seq(0, 100, 5))



ggplotly(plotage)

Plot Analysis by Age

bank_data_age = bank_data %>% 
  mutate(age = if_else(age > 60, "high", if_else(age > 30, "mid", "low")))

bank_data_age = bank_data_age %>% 
  filter(age != "unknown")

 
  plot_age <- ggplot(bank_data_age) +
  geom_mosaic(aes(x = product(y, age), fill = y)) +
  xlab("age") +
  ylab(NULL)
  
  ggplotly(plot_age)

45.5% of people over 60 years old subscribed a term deposit, which is a lot in comparison with younger individuals (15.2% for young adults (aged lower than 30) and only 9.4% for the remaining observations (aged between 30 and 60)).

b. Job

Customer Job Distribution Profile

#Count Customer Job Frequency

bankjob <-count(bank_data,bank_data$job)

#Count yes or no frequency based on jobs
bank_job_yesno <- count(bank_data,bank_data$job,bank_data$y)

# Convert wide to long

bank_job_wide <- pivot_wider(data = bank_job_yesno, names_from = `bank_data$y`, values_from = n)

# Assigning bank job frequency to bank_job_wide

bank_job_wide <- bank_job_wide %>%
  mutate(freq = bankjob$n)


names(bank_job_wide)[1] <- "jobs"
names(bank_job_wide)[4] <- "frequency"

# Create distribution of customers by job
bank_job_wide$jobs <- reorder(bank_job_wide$jobs, -bank_job_wide$frequency)
plotjobs <- ggplot(bank_job_wide,aes(`jobs`, frequency))+
  geom_col((aes(fill = jobs))) +
  theme(axis.text.x = element_text(angle = 45, hjust=1))

ggplotly(plotjobs)

Plot Analysis by job

### Distribution of Customer

names(bank_job_wide)[2] <- "no"
names(bank_job_wide)[3] <- "yes"

bank_job_wide <- bank_job_wide %>%
  mutate(total = no + yes) %>%
  mutate(percentage_no = no / total * 100) %>%
  mutate(percentage_yes = yes / total * 100)

# bank_job_wide$total <- bank_job_wide$no + bank_job_wide$yes
# bank_job_wide$percentage_no <- bank_job_wide$no/bank_job_wide$total*100
# bank_job_wide$percentage_yes <- bank_job_wide$yes/bank_job_wide$total*100


plotjobs2<- ggplot(bank_job_wide, aes(`jobs`, percentage_yes)) +
  geom_point(aes(color = jobs, size = frequency)) + 
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
   labs( title = "Distribution of customer by jobs",
        x = "Jobs", 
        y = "yes Percentage",
        color = "jobs", size = "frequency") + 
        theme(text = element_text(face = "bold"))

ggplotly(plotjobs2)

plot_job_2 <- bank_data %>% 
  ggplot() +
  geom_mosaic(aes(x = product(y,job), fill = (y))) +
  mosaic_theme

ggplotly(plot_job_2)

Even though admin and blue collar received the highest frequency of call, we can see that those who more likely to subscribe are student and retired.

Surprisingly, students (31.4%), retired people (25.2%) and unemployed (14.2%) categories show the best relative frequencies of term deposit subscription.

c. Marital

Customer Marital Distribution Profile

fun_crosstable(bank_data, "marital", "y")

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41188 
## 
##  
##              | y 
##      marital |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##     divorced |      4136 |       476 |      4612 | 
##              |     0.897 |     0.103 |     0.112 | 
## -------------|-----------|-----------|-----------|
##      married |     22396 |      2532 |     24928 | 
##              |     0.898 |     0.102 |     0.605 | 
## -------------|-----------|-----------|-----------|
##       single |      9948 |      1620 |     11568 | 
##              |     0.860 |     0.140 |     0.281 | 
## -------------|-----------|-----------|-----------|
##      unknown |        68 |        12 |        80 | 
##              |     0.850 |     0.150 |     0.002 | 
## -------------|-----------|-----------|-----------|
## Column Total |     36548 |      4640 |     41188 | 
## -------------|-----------|-----------|-----------|
## 
##

bank_data = bank_data %>% 
  filter(marital != "unknown")

Customer Marital Profile

#Count Customer Married Frequency

bank_marital <-count(bank_data,bank_data$marital)

#Count yes or no frequency based on jobs
bank_marital_yesno <- count(bank_data,bank_data$marital,bank_data$y)

# Convert wide to long

bank_marital_wide <- pivot_wider(data = bank_marital_yesno, names_from = `bank_data$y`, values_from = n)

# Assigning bank job frequency to bank_job_wide

bank_marital_wide <- bank_marital_wide %>%
  mutate(freq = bank_marital$n)

names(bank_marital_wide)[1] <- "marital"
names(bank_marital_wide)[4] <- "frequency"

# Create distribution of customers by job
plotmarital <- ggplot(bank_marital_wide,aes(`marital`, frequency))+
  geom_col() +
  theme(axis.text.x = element_text(angle = 45, hjust=1))


plotmaritala <- bank_data %>% 
  ggplot() +
  aes(x = marital, y = ..count../nrow(bank_data), fill = y) +
  geom_bar() +
  ylab("relative frequency")

ggplotly(plotmaritala)

Plot Analysis by Marital

# <br> Frequency :{bank_marital_long$value}"
plotmarital3<- bank_data %>% 
  ggplot()  +
  geom_mosaic(aes(x = product(y, marital), fill = y,
                  )) +
  mosaic_theme +
  xlab("Marital status") +
  ylab(NULL)

ggplotly(plotmarital3, tooltip = "text")

### Distribution of Customer

names(bank_marital_wide)[2] <- "no"
names(bank_marital_wide)[3] <- "yes"

bank_marital_wide <- bank_marital_wide %>%
  mutate(total = no + yes) %>%
  mutate(percentage_no = no / total * 100) %>%
  mutate(percentage_yes = yes / total * 100)

bank_marital_long <- bank_marital_wide %>% 
  pivot_longer(cols = c(percentage_yes, percentage_no),
               names_to = "percentage", 
               values_to = "value" ) %>% 
  mutate(text = glue(
    "Marital status : {marital}
     Total : {frequency}
     percentage : {round(value,2)}%"))


plotmarital3 <- ggplot(bank_marital_long, aes(x = marital, 
                                 y = value,
                                 text = text)) +
  geom_col(aes(fill = percentage), position = "dodge") +
  coord_flip() +
  labs(x = NULL,
       y = NULL,
       title = "Marital Distribution of Customer") +
  theme(legend.position = "none") +
  theme_algoritma  

ggplotly(plotmarital3, tooltip = "text")

# plotmarital3 <- ggplot(bank_marital_long, aes(x = marital,
#                                  y = value)) +
#   geom_point(aes(color = marital, size = frequency, position = "dodge")) +
#   coord_flip() +
#   labs(x = NULL,
#        y = NULL,
#        title = "Marital Distribution of Customer") +
#   theme(legend.position = "none") +
#   theme_algoritma
# 
# ggplotly(plotmarital3)
 

plotmarital2<- ggplot(bank_marital_wide, aes(`marital`, percentage_yes)) +
  geom_point(aes(color = marital, size = frequency)) +
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
   labs( title = "Distribution of customer by marital",
        x = "Marital",
        y = "yes Percentage",
        color = "jobs", size = "frequency") +
        theme(text = element_text(face = "bold"))

ggplotly(plotmarital2)

From the plot, we can conclude that celibates subscribe slightly more.

d. Education

Customer Education Distribution Profile

fun_crosstable(bank_data, "education", "y")

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41108 
## 
##  
##                     | y 
##           education |         0 |         1 | Row Total | 
## --------------------|-----------|-----------|-----------|
##            basic.4y |      3743 |       427 |      4170 | 
##                     |     0.898 |     0.102 |     0.101 | 
## --------------------|-----------|-----------|-----------|
##            basic.6y |      2098 |       188 |      2286 | 
##                     |     0.918 |     0.082 |     0.056 | 
## --------------------|-----------|-----------|-----------|
##            basic.9y |      5566 |       471 |      6037 | 
##                     |     0.922 |     0.078 |     0.147 | 
## --------------------|-----------|-----------|-----------|
##         high.school |      8471 |      1030 |      9501 | 
##                     |     0.892 |     0.108 |     0.231 | 
## --------------------|-----------|-----------|-----------|
##          illiterate |        14 |         4 |        18 | 
##                     |     0.778 |     0.222 |     0.000 | 
## --------------------|-----------|-----------|-----------|
## professional.course |      4642 |       595 |      5237 | 
##                     |     0.886 |     0.114 |     0.127 | 
## --------------------|-----------|-----------|-----------|
##   university.degree |     10473 |      1664 |     12137 | 
##                     |     0.863 |     0.137 |     0.295 | 
## --------------------|-----------|-----------|-----------|
##             unknown |      1473 |       249 |      1722 | 
##                     |     0.855 |     0.145 |     0.042 | 
## --------------------|-----------|-----------|-----------|
##        Column Total |     36480 |      4628 |     41108 | 
## --------------------|-----------|-----------|-----------|
## 
##

bank_data = bank_data %>% 
  filter(education != "illiterate")

bank_data = bank_data %>% 
  mutate(education = recode(education, "unknown" = "university.degree"))

#Count Customer Education Frequency

bank_education <-count(bank_data,bank_data$education)

#Count yes or no frequency based on jobs
bank_education_yesno <- count(bank_data,bank_data$education,bank_data$y)

# Convert wide to long

bank_education_wide <- pivot_wider(data = bank_education_yesno, names_from = `bank_data$y`, values_from = n)

# Assigning bank job frequency to bank_job_wide

bank_education_wide <- bank_education_wide %>%
  mutate(freq = bank_education$n)

names(bank_education_wide)[1] <- "education"
names(bank_education_wide)[4] <- "frequency"

# Create distribution of customers by job
ploteducation <- ggplot(bank_education_wide,aes(`education`, frequency))+
  geom_col() +
  theme(axis.text.x = element_text(angle = 45, hjust=1))

ggplotly(ploteducation)

Plot Analysis by Education

bank_data %>% 
  ggplot() +
  geom_mosaic(aes(x = product(y, education), fill = y)) +
  mosaic_theme +
  xlab("Education") +
  ylab(NULL)

### Distribution of Customer

names(bank_education_wide)[2] <- "no"
names(bank_education_wide)[3] <- "yes"

bank_education_wide <- bank_education_wide %>%
  mutate(total = no + yes) %>%
  mutate(percentage_no = no / total * 100) %>%
  mutate(percentage_yes = yes / total * 100)

bank_education_long <- pivot_longer(data = bank_education_wide, 
                             cols = c(percentage_yes, percentage_no),
                             names_to = "percentage", 
                             values_to = "value" )


ploteducation3 <- ggplot(bank_education_long, aes(x = education, 
                                 y = value)) +
  geom_col(aes(fill = percentage), position = "fill") +
  coord_flip() +
  labs(x = NULL,
       y = NULL,
       title = "Education Distribution of Customer") +
  theme(legend.position = "none") +
  theme_algoritma  

ggplotly(ploteducation3)

# plotmarital3 <- ggplot(bank_marital_long, aes(x = marital,
#                                  y = value)) +
#   geom_point(aes(color = marital, size = frequency, position = "dodge")) +
#   coord_flip() +
#   labs(x = NULL,
#        y = NULL,
#        title = "Marital Distribution of Customer") +
#   theme(legend.position = "none") +
#   theme_algoritma
# 
# ggplotly(plotmarital3)
 

ploteducation2<- ggplot(bank_education_wide, aes(`education`, percentage_yes)) +
  geom_point(aes(color = education, size = frequency)) +
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
   labs( title = "Distribution of customer by marital",
        x = "Education",
        y = "yes Percentage",
        color = "jobs", size = "frequency") +
        theme(text = element_text(face = "bold"))

ggplotly(ploteducation2)

We can see that there is correlation between the higher education and the probability of subscriptions. ## e. Default Does the client have a credit in default?

fun_crosstable(bank_data, "default", "y")

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41090 
## 
##  
##              | y 
##      default |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |     28326 |      4182 |     32508 | 
##              |     0.871 |     0.129 |     0.791 | 
## -------------|-----------|-----------|-----------|
##      unknown |      8137 |       442 |      8579 | 
##              |     0.948 |     0.052 |     0.209 | 
## -------------|-----------|-----------|-----------|
##          yes |         3 |         0 |         3 | 
##              |     1.000 |     0.000 |     0.000 | 
## -------------|-----------|-----------|-----------|
## Column Total |     36466 |      4624 |     41090 | 
## -------------|-----------|-----------|-----------|
## 
##

Feature certainly not usable because only 3 people replied with yes

bank_data = bank_data %>% 
  select(-default)

f. Housing

fun_crosstable(bank_data, "housing", "y")

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41090 
## 
##  
##              | y 
##      housing |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |     16552 |      2018 |     18570 | 
##              |     0.891 |     0.109 |     0.452 | 
## -------------|-----------|-----------|-----------|
##      unknown |       882 |       107 |       989 | 
##              |     0.892 |     0.108 |     0.024 | 
## -------------|-----------|-----------|-----------|
##          yes |     19032 |      2499 |     21531 | 
##              |     0.884 |     0.116 |     0.524 | 
## -------------|-----------|-----------|-----------|
## Column Total |     36466 |      4624 |     41090 | 
## -------------|-----------|-----------|-----------|
## 
##

chisq.test(bank_data$housing, bank_data$y)

## 
##  Pearson's Chi-squared test
## 
## data:  bank_data$housing and bank_data$y
## X-squared = 5.6515, df = 2, p-value = 0.05926

Since the p-value is abocve 5 percent, for confidence level 95 %, we can conclude that there’s no association between the dependent variable y and our feature housing.

bank_data = bank_data %>% 
  select(-loan)

g. Contact

How was the client contacted?

fun_crosstable(bank_data, "contact", "y")

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41090 
## 
##  
##              | y 
##      contact |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##     cellular |     22237 |      3839 |     26076 | 
##              |     0.853 |     0.147 |     0.635 | 
## -------------|-----------|-----------|-----------|
##    telephone |     14229 |       785 |     15014 | 
##              |     0.948 |     0.052 |     0.365 | 
## -------------|-----------|-----------|-----------|
## Column Total |     36466 |      4624 |     41090 | 
## -------------|-----------|-----------|-----------|
## 
##

This feature is really interesting, 14.7% of cellular responders subscribed to a term deposit while only 5.2% of telephone responders did.

h. Month

Customer Month Distribution Profile

month_recode = c("jan" = "(01)jan",
                 "feb" = "(02)feb",
                 "mar" = "(03)mar",
                 "apr" = "(04)apr",
                 "may" = "(05)may",
                 "jun" = "(06)jun",
                 "jul" = "(07)jul",
                 "aug" = "(08)aug",
                 "sep" = "(09)sep",
                 "oct" = "(10)oct",
                 "nov" = "(11)nov",
                 "dec" = "(12)dec")

bank_data = bank_data %>% 
  mutate(month = recode(month, !!!month_recode))

fun_crosstable(bank_data, "month", "y")

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41090 
## 
##  
##              | y 
##        month |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##      (03)mar |       269 |       274 |       543 | 
##              |     0.495 |     0.505 |     0.013 | 
## -------------|-----------|-----------|-----------|
##      (04)apr |      2089 |       538 |      2627 | 
##              |     0.795 |     0.205 |     0.064 | 
## -------------|-----------|-----------|-----------|
##      (05)may |     12849 |       884 |     13733 | 
##              |     0.936 |     0.064 |     0.334 | 
## -------------|-----------|-----------|-----------|
##      (06)jun |      4748 |       558 |      5306 | 
##              |     0.895 |     0.105 |     0.129 | 
## -------------|-----------|-----------|-----------|
##      (07)jul |      6513 |       647 |      7160 | 
##              |     0.910 |     0.090 |     0.174 | 
## -------------|-----------|-----------|-----------|
##      (08)aug |      5514 |       649 |      6163 | 
##              |     0.895 |     0.105 |     0.150 | 
## -------------|-----------|-----------|-----------|
##      (09)sep |       314 |       256 |       570 | 
##              |     0.551 |     0.449 |     0.014 | 
## -------------|-----------|-----------|-----------|
##      (10)oct |       401 |       314 |       715 | 
##              |     0.561 |     0.439 |     0.017 | 
## -------------|-----------|-----------|-----------|
##      (11)nov |      3676 |       415 |      4091 | 
##              |     0.899 |     0.101 |     0.100 | 
## -------------|-----------|-----------|-----------|
##      (12)dec |        93 |        89 |       182 | 
##              |     0.511 |     0.489 |     0.004 | 
## -------------|-----------|-----------|-----------|
## Column Total |     36466 |      4624 |     41090 | 
## -------------|-----------|-----------|-----------|
## 
##

plotmonth<- bank_data %>% 
  ggplot() +
  aes(x = month, y = ..count../nrow(bank_data), fill = y) +
  geom_bar() +
  ylab("relative frequency")

ggplotly(plotmonth)

First of all, we can notice that no contact has been made during January and February. The highest spike occurs during May, with 33.4% of total contacts, but it has the worst ratio of subscribers over persons contacted (6.5%). Every month with a very low frequency of contact (march, september, october and december) shows very good results (between 44% and 51% of subscribers). December aside, there are enough observations to conclude this isn’t pure luck, so this feature will probably be very important in models.

#Count Customer Month Frequency

bank_month <-count(bank_data,bank_data$month)

#Count yes or no frequency based on jobs
bank_month_yesno <- count(bank_data,bank_data$month,bank_data$y)

# Convert wide to long

bank_month_wide <- pivot_wider(data = bank_month_yesno, names_from = `bank_data$y`, values_from = n)

# Assigning bank job frequency to bank_job_wide

bank_month_wide <- bank_month_wide %>%
  mutate(freq = bank_month$n)

names(bank_month_wide)[1] <- "month"
names(bank_month_wide)[4] <- "frequency"

# Create distribution of customers by job
plotmonth <- ggplot(bank_month_wide,aes(`month`, frequency))+
  geom_col() +
  theme(axis.text.x = element_text(angle = 45, hjust=1))

Plot Analysis by month

### Distribution of Customer

names(bank_month_wide)[2] <- "no"
names(bank_month_wide)[3] <- "yes"

bank_month_wide <- bank_month_wide %>%
  mutate(total = no + yes) %>%
  mutate(percentage_no = no / total * 100) %>%
  mutate(percentage_yes = yes / total * 100)

bank_month_long <- pivot_longer(data = bank_month_wide, 
                             cols = c(percentage_yes, percentage_no),
                             names_to = "percentage", 
                             values_to = "value" )


plotmonth3 <- ggplot(bank_month_long, aes(x = month, 
                                 y = value)) +
  geom_col(aes(fill = percentage), position = "fill") +
  coord_flip() +
  labs(x = NULL,
       y = NULL,
       title = "Month Distribution of Customer") +
  theme(legend.position = "none") +
  theme_algoritma  

ggplotly(plotmonth3)

# plotmarital3 <- ggplot(bank_marital_long, aes(x = marital,
#                                  y = value)) +
#   geom_point(aes(color = marital, size = frequency, position = "dodge")) +
#   coord_flip() +
#   labs(x = NULL,
#        y = NULL,
#        title = "Marital Distribution of Customer") +
#   theme(legend.position = "none") +
#   theme_algoritma
# 
# ggplotly(plotmarital3)
 

plotmonth2<- ggplot(bank_month_wide, aes(`month`, percentage_yes)) +
  geom_point(aes(color = month, size = frequency)) +
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
   labs( title = "Distribution of customer by month",
        x = "Month",
        y = "yes Percentage",
        color = "jobs", size = "frequency") +
        theme(text = element_text(face = "bold"))

ggplotly(plotmonth2)

i. Day of the Week

Customer Day of the week distribution profile

day_recode = c("mon" = "(01)mon",
               "tue" = "(02)tue",
               "wed" = "(03)wed",
               "thu" = "(04)thu",
               "fri" = "(05)fri")

bank_data = bank_data %>% 
  mutate(day_of_week = recode(day_of_week, !!!day_recode))

fun_crosstable(bank_data, "day_of_week", "y")

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41090 
## 
##  
##              | y 
##  day_of_week |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##      (01)mon |      7650 |       844 |      8494 | 
##              |     0.901 |     0.099 |     0.207 | 
## -------------|-----------|-----------|-----------|
##      (02)tue |      7122 |       952 |      8074 | 
##              |     0.882 |     0.118 |     0.196 | 
## -------------|-----------|-----------|-----------|
##      (03)wed |      7174 |       944 |      8118 | 
##              |     0.884 |     0.116 |     0.198 | 
## -------------|-----------|-----------|-----------|
##      (04)thu |      7553 |      1040 |      8593 | 
##              |     0.879 |     0.121 |     0.209 | 
## -------------|-----------|-----------|-----------|
##      (05)fri |      6967 |       844 |      7811 | 
##              |     0.892 |     0.108 |     0.190 | 
## -------------|-----------|-----------|-----------|
## Column Total |     36466 |      4624 |     41090 | 
## -------------|-----------|-----------|-----------|
## 
##

Calls aren’t made during weekend days. If calls are evenly distributed between the different week days, Thursdays tend to show better results (12.1% of subscribers among calls made this day) unlike Mondays with only 10.0% of successful calls. However, those differences are small, which makes this feature not that important. It would’ve been interesting to see the attitude of responders from weekend calls.

Plot Analysis by Day of Week

plotdowa<- bank_data %>% 
  ggplot() +
  aes(x = day_of_week, y = ..count../nrow(bank_data), fill = y) +
  geom_bar() +
  ylab("relative frequency")

ggplotly(plotdowa)

#Count Customer Dow Frequency

bank_dow <-count(bank_data,bank_data$day_of_week)

#Count yes or no frequency based on jobs
bank_dow_yesno <- count(bank_data,bank_data$day_of_week,bank_data$y)

# Convert wide to long

bank_dow_wide <- pivot_wider(data = bank_dow_yesno, names_from = `bank_data$y`, values_from = n)

# Assigning bank job frequency to bank_job_wide

bank_dow_wide <- bank_dow_wide %>%
  mutate(freq = bank_dow$n)

names(bank_dow_wide)[1] <- "day_of_week"
names(bank_dow_wide)[4] <- "frequency"

# Create distribution of customers by job
plotdow <- ggplot(bank_dow_wide,aes(`day_of_week`, frequency))+
  geom_col() +
  theme(axis.text.x = element_text(angle = 45, hjust=1))

### Distribution of Customer

names(bank_dow_wide)[2] <- "no"
names(bank_dow_wide)[3] <- "yes"

bank_dow_wide <- bank_dow_wide %>%
  mutate(total = no + yes) %>%
  mutate(percentage_no = no / total * 100) %>%
  mutate(percentage_yes = yes / total * 100)

bank_dow_long <- pivot_longer(data = bank_dow_wide, 
                             cols = c(percentage_yes, percentage_no),
                             names_to = "percentage", 
                             values_to = "value" )

plotdow3 <- ggplot(bank_dow_long, aes(x = day_of_week, 
                                 y = value)) +
  geom_col(aes(fill = percentage), position = "fill") +
  coord_flip() +
  labs(x = NULL,
       y = NULL,
       title = "Day of week Distribution of Customer") +
  theme(legend.position = "none") +
  theme_algoritma  

ggplotly(plotdow3)

# plotmarital3 <- ggplot(bank_marital_long, aes(x = marital,
#                                  y = value)) +
#   geom_point(aes(color = marital, size = frequency, position = "dodge")) +
#   coord_flip() +
#   labs(x = NULL,
#        y = NULL,
#        title = "Marital Distribution of Customer") +
#   theme(legend.position = "none") +
#   theme_algoritma
# 
# ggplotly(plotmarital3)
 

plotdow2<- ggplot(bank_dow_wide, aes(`day_of_week`, percentage_yes)) +
  geom_point(aes(color = day_of_week, size = frequency)) +
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
   labs( title = "Distribution of customer by day of week",
        x = "Day of week",
        y = "yes Percentage",
        color = "day_of_week", size = "frequency") +
        theme(text = element_text(face = "bold"))

ggplotly(plotdow2)

5. Conclusion

From this exploratory we have derived a lot of information from the data and visualize it in a way to make a strategy in which the company is able to identify customers who are more likely to subscribe is desirable and would allow greater focus on those customers most likely to generate a sale.

age : 45.5% of people over 60 years old subscribed a term deposit, which is a lot in comparison with younger individuals (15.2% for young adults (aged lower than 30) and only 9.4% for the remaining observations (aged between 30 and 60)).

jobs : Even though admin and blue collar received the highest frequency of call, we can see that those who more likely to subscribe are student and retired.

Surprisingly, students (31.4%), retired people (25.2%) and unemployed (14.2%) categories show the best relative frequencies of term deposit subscription.

marital : Celibates slightly subscribe more often (14.0%) to term deposits than others (divorced (10.3%) and married (10.2%)).

education : It appears that a positive correlation between the number of years of education and the odds to subscribe to a term deposit exists.

month : The highest spike occurs during May, but it also has the worst ratio of subscribers over persons contacted. Surprisingly every month with a low frequency of contact (March, September, October and December) show good results.

contact : Thursday tends to show better results (12.1% of subscribers made this day)