1. Introduction

Bank direct marketing is an interactive process of building beneficial relationships among stakeholders. Effective multichannel communication involves the study of customer characteristics and behavior. Apart from profit growth, which may raise customer loyalty and positive responses the goal of bank direct marketing is to increase the response rates of direct promotion campaigns.

The usage of data visualization by decision makers and their organizations offers many benefits, that includes absorbing information in new and constructive ways. Visualizing relationships and patterns between operational and business activities can help identify and act on emerging trends. Visualization also enables users to manipulate and interact with data directly and fosters a new business language to tell the most relevant story. The choice of a proper visualization technique depends on many factors, such as the type of data (numerical or categorical), the nature of the domain of interest, and the final visualization purpose, which may involve plotting of the distribution of data points or comparing different attributes over the same data point.

2. Goals & Objective

Goal
Create an exploratory data analysis from data to derived strategy in which the company is able to identify customers who are more likely to subscribe is desirable and would allow greater focus on those customers most likely to generate a sale.

Objective

Create an exploratory data analysis Using R.

3. Methodology

The first was to import and to do a quick cheching for the that we using.

Next a preprocessing phase is first implemented to balance the data distribution

After that we will transform the data for the usage according to the business question related.

4. Exploratory Data Analysis

Read Library

## Warning: package 'flexdashboard' was built under R version 3.6.3
## -- Attaching packages ------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## Warning: package 'readr' was built under R version 3.6.3
## -- Conflicts ---------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## 
## Attaching package: 'glue'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## Warning: package 'ggmosaic' was built under R version 3.6.3
## Warning: package 'gmodels' was built under R version 3.6.3

Read Data

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41188 
## 
##  
##           |         0 |         1 | 
##           |-----------|-----------|
##           |     36548 |      4640 | 
##           |     0.887 |     0.113 | 
##           |-----------|-----------|
## 
## 
## 
## 

This is an unbalanced two-levels categorical variable, 88.7% of values taken are “no” (or “0”) and only 11.3% of the values are “yes” (or “1”). It is more natural to work with a 0/1 dependent variable:

Finding out which variables suffer from the missing value the most

##          variable nr_unknown
## 1         default       8597
## 2       education       1731
## 3         housing        990
## 4            loan        990
## 5             job        330
## 6         marital         80
## 7             age          0
## 8         contact          0
## 9           month          0
## 10    day_of_week          0
## 11       duration          0
## 12       campaign          0
## 13          pdays          0
## 14       previous          0
## 15       poutcome          0
## 16   emp.var.rate          0
## 17 cons.price.idx          0
## 18  cons.conf.idx          0
## 19      euribor3m          0
## 20    nr.employed          0
## 21              y          0

a. Age

Customer Age Profile

Plot Analysis by Age

45.5% of people over 60 years old subscribed a term deposit, which is a lot in comparison with younger individuals (15.2% for young adults (aged lower than 30) and only 9.4% for the remaining observations (aged between 30 and 60)).

b. Job

Customer Job Distribution Profile

Plot Analysis by job

Even though admin and blue collar received the highest frequency of call, we can see that those who more likely to subscribe are student and retired.

Surprisingly, students (31.4%), retired people (25.2%) and unemployed (14.2%) categories show the best relative frequencies of term deposit subscription.

c. Marital

Customer Marital Distribution Profile

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41188 
## 
##  
##              | y 
##      marital |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##     divorced |      4136 |       476 |      4612 | 
##              |     0.897 |     0.103 |     0.112 | 
## -------------|-----------|-----------|-----------|
##      married |     22396 |      2532 |     24928 | 
##              |     0.898 |     0.102 |     0.605 | 
## -------------|-----------|-----------|-----------|
##       single |      9948 |      1620 |     11568 | 
##              |     0.860 |     0.140 |     0.281 | 
## -------------|-----------|-----------|-----------|
##      unknown |        68 |        12 |        80 | 
##              |     0.850 |     0.150 |     0.002 | 
## -------------|-----------|-----------|-----------|
## Column Total |     36548 |      4640 |     41188 | 
## -------------|-----------|-----------|-----------|
## 
## 

Customer Marital Profile

Plot Analysis by Marital

From the plot, we can conclude that celibates subscribe slightly more.

d. Education

Customer Education Distribution Profile

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41108 
## 
##  
##                     | y 
##           education |         0 |         1 | Row Total | 
## --------------------|-----------|-----------|-----------|
##            basic.4y |      3743 |       427 |      4170 | 
##                     |     0.898 |     0.102 |     0.101 | 
## --------------------|-----------|-----------|-----------|
##            basic.6y |      2098 |       188 |      2286 | 
##                     |     0.918 |     0.082 |     0.056 | 
## --------------------|-----------|-----------|-----------|
##            basic.9y |      5566 |       471 |      6037 | 
##                     |     0.922 |     0.078 |     0.147 | 
## --------------------|-----------|-----------|-----------|
##         high.school |      8471 |      1030 |      9501 | 
##                     |     0.892 |     0.108 |     0.231 | 
## --------------------|-----------|-----------|-----------|
##          illiterate |        14 |         4 |        18 | 
##                     |     0.778 |     0.222 |     0.000 | 
## --------------------|-----------|-----------|-----------|
## professional.course |      4642 |       595 |      5237 | 
##                     |     0.886 |     0.114 |     0.127 | 
## --------------------|-----------|-----------|-----------|
##   university.degree |     10473 |      1664 |     12137 | 
##                     |     0.863 |     0.137 |     0.295 | 
## --------------------|-----------|-----------|-----------|
##             unknown |      1473 |       249 |      1722 | 
##                     |     0.855 |     0.145 |     0.042 | 
## --------------------|-----------|-----------|-----------|
##        Column Total |     36480 |      4628 |     41108 | 
## --------------------|-----------|-----------|-----------|
## 
## 

Plot Analysis by Education

We can see that there is correlation between the higher education and the probability of subscriptions. ## e. Default Does the client have a credit in default?

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41090 
## 
##  
##              | y 
##      default |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |     28326 |      4182 |     32508 | 
##              |     0.871 |     0.129 |     0.791 | 
## -------------|-----------|-----------|-----------|
##      unknown |      8137 |       442 |      8579 | 
##              |     0.948 |     0.052 |     0.209 | 
## -------------|-----------|-----------|-----------|
##          yes |         3 |         0 |         3 | 
##              |     1.000 |     0.000 |     0.000 | 
## -------------|-----------|-----------|-----------|
## Column Total |     36466 |      4624 |     41090 | 
## -------------|-----------|-----------|-----------|
## 
## 

Feature certainly not usable because only 3 people replied with yes

f. Housing

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41090 
## 
##  
##              | y 
##      housing |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##           no |     16552 |      2018 |     18570 | 
##              |     0.891 |     0.109 |     0.452 | 
## -------------|-----------|-----------|-----------|
##      unknown |       882 |       107 |       989 | 
##              |     0.892 |     0.108 |     0.024 | 
## -------------|-----------|-----------|-----------|
##          yes |     19032 |      2499 |     21531 | 
##              |     0.884 |     0.116 |     0.524 | 
## -------------|-----------|-----------|-----------|
## Column Total |     36466 |      4624 |     41090 | 
## -------------|-----------|-----------|-----------|
## 
## 
## 
##  Pearson's Chi-squared test
## 
## data:  bank_data$housing and bank_data$y
## X-squared = 5.6515, df = 2, p-value = 0.05926

Since the p-value is abocve 5 percent, for confidence level 95 %, we can conclude that there’s no association between the dependent variable y and our feature housing.

g. Contact

How was the client contacted?

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41090 
## 
##  
##              | y 
##      contact |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##     cellular |     22237 |      3839 |     26076 | 
##              |     0.853 |     0.147 |     0.635 | 
## -------------|-----------|-----------|-----------|
##    telephone |     14229 |       785 |     15014 | 
##              |     0.948 |     0.052 |     0.365 | 
## -------------|-----------|-----------|-----------|
## Column Total |     36466 |      4624 |     41090 | 
## -------------|-----------|-----------|-----------|
## 
## 

This feature is really interesting, 14.7% of cellular responders subscribed to a term deposit while only 5.2% of telephone responders did.

h. Month

Customer Month Distribution Profile

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41090 
## 
##  
##              | y 
##        month |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##      (03)mar |       269 |       274 |       543 | 
##              |     0.495 |     0.505 |     0.013 | 
## -------------|-----------|-----------|-----------|
##      (04)apr |      2089 |       538 |      2627 | 
##              |     0.795 |     0.205 |     0.064 | 
## -------------|-----------|-----------|-----------|
##      (05)may |     12849 |       884 |     13733 | 
##              |     0.936 |     0.064 |     0.334 | 
## -------------|-----------|-----------|-----------|
##      (06)jun |      4748 |       558 |      5306 | 
##              |     0.895 |     0.105 |     0.129 | 
## -------------|-----------|-----------|-----------|
##      (07)jul |      6513 |       647 |      7160 | 
##              |     0.910 |     0.090 |     0.174 | 
## -------------|-----------|-----------|-----------|
##      (08)aug |      5514 |       649 |      6163 | 
##              |     0.895 |     0.105 |     0.150 | 
## -------------|-----------|-----------|-----------|
##      (09)sep |       314 |       256 |       570 | 
##              |     0.551 |     0.449 |     0.014 | 
## -------------|-----------|-----------|-----------|
##      (10)oct |       401 |       314 |       715 | 
##              |     0.561 |     0.439 |     0.017 | 
## -------------|-----------|-----------|-----------|
##      (11)nov |      3676 |       415 |      4091 | 
##              |     0.899 |     0.101 |     0.100 | 
## -------------|-----------|-----------|-----------|
##      (12)dec |        93 |        89 |       182 | 
##              |     0.511 |     0.489 |     0.004 | 
## -------------|-----------|-----------|-----------|
## Column Total |     36466 |      4624 |     41090 | 
## -------------|-----------|-----------|-----------|
## 
## 

First of all, we can notice that no contact has been made during January and February. The highest spike occurs during May, with 33.4% of total contacts, but it has the worst ratio of subscribers over persons contacted (6.5%). Every month with a very low frequency of contact (march, september, october and december) shows very good results (between 44% and 51% of subscribers). December aside, there are enough observations to conclude this isn’t pure luck, so this feature will probably be very important in models.

Plot Analysis by month

i. Day of the Week

Customer Day of the week distribution profile

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41090 
## 
##  
##              | y 
##  day_of_week |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##      (01)mon |      7650 |       844 |      8494 | 
##              |     0.901 |     0.099 |     0.207 | 
## -------------|-----------|-----------|-----------|
##      (02)tue |      7122 |       952 |      8074 | 
##              |     0.882 |     0.118 |     0.196 | 
## -------------|-----------|-----------|-----------|
##      (03)wed |      7174 |       944 |      8118 | 
##              |     0.884 |     0.116 |     0.198 | 
## -------------|-----------|-----------|-----------|
##      (04)thu |      7553 |      1040 |      8593 | 
##              |     0.879 |     0.121 |     0.209 | 
## -------------|-----------|-----------|-----------|
##      (05)fri |      6967 |       844 |      7811 | 
##              |     0.892 |     0.108 |     0.190 | 
## -------------|-----------|-----------|-----------|
## Column Total |     36466 |      4624 |     41090 | 
## -------------|-----------|-----------|-----------|
## 
## 

Calls aren’t made during weekend days. If calls are evenly distributed between the different week days, Thursdays tend to show better results (12.1% of subscribers among calls made this day) unlike Mondays with only 10.0% of successful calls. However, those differences are small, which makes this feature not that important. It would’ve been interesting to see the attitude of responders from weekend calls.

Plot Analysis by Day of Week

5. Conclusion

From this exploratory we have derived a lot of information from the data and visualize it in a way to make a strategy in which the company is able to identify customers who are more likely to subscribe is desirable and would allow greater focus on those customers most likely to generate a sale.

age : 45.5% of people over 60 years old subscribed a term deposit, which is a lot in comparison with younger individuals (15.2% for young adults (aged lower than 30) and only 9.4% for the remaining observations (aged between 30 and 60)).

jobs : Even though admin and blue collar received the highest frequency of call, we can see that those who more likely to subscribe are student and retired.

Surprisingly, students (31.4%), retired people (25.2%) and unemployed (14.2%) categories show the best relative frequencies of term deposit subscription.

marital : Celibates slightly subscribe more often (14.0%) to term deposits than others (divorced (10.3%) and married (10.2%)).

education : It appears that a positive correlation between the number of years of education and the odds to subscribe to a term deposit exists.

month : The highest spike occurs during May, but it also has the worst ratio of subscribers over persons contacted. Surprisingly every month with a low frequency of contact (March, September, October and December) show good results.

contact : Thursday tends to show better results (12.1% of subscribers made this day)