EDA - Bank Data Set

Introduction

The data set used is related with a direct marketing campaigns of a Portuguese banking institution and was taken from May 2008 to November 2010. It has 21 inputs which record the clients’ personal data, data related with the last contact of the current campaign, other attributes, social and economic context attributes and output variable. Further exploratory and detailed analysis of the data will be shown below.

Data Reading

The data used, “bank-additional-full.csv”, can be found in the same folder “LBB1” under the sub folder “data_input” Read the data by running the code below:

library(knitr)
bank <- read.csv("data_input/bank-additional-full.csv")
kable(head(bank))
age.job.marital.education.default.housing.loan.contact.month.day_of_week.duration.campaign.pdays.previous.poutcome.emp.var.rate.cons.price.idx.cons.conf.idx.euribor3m.nr.employed.y
56;housemaid;married;basic.4y;no;no;no;telephone;may;mon;261;1;999;0;nonexistent;1.1;93.994;-36.4;4.857;5191;no
57;services;married;high.school;unknown;no;no;telephone;may;mon;149;1;999;0;nonexistent;1.1;93.994;-36.4;4.857;5191;no
37;services;married;high.school;no;yes;no;telephone;may;mon;226;1;999;0;nonexistent;1.1;93.994;-36.4;4.857;5191;no
40;admin.;married;basic.6y;no;no;no;telephone;may;mon;151;1;999;0;nonexistent;1.1;93.994;-36.4;4.857;5191;no
56;services;married;high.school;no;no;yes;telephone;may;mon;307;1;999;0;nonexistent;1.1;93.994;-36.4;4.857;5191;no
45;services;married;basic.9y;unknown;no;no;telephone;may;mon;198;1;999;0;nonexistent;1.1;93.994;-36.4;4.857;5191;no

It is clear that the data is clumped into one column and needs to be separated into its separate columns.

Data Cleansing & Explicit Coercions

tidyverse will be used to separate the data into its respective columns.

library(tidyverse)
bank <- bank %>% 
  separate(col = age.job.marital.education.default.housing.loan.contact.month.day_of_week.duration.campaign.pdays.previous.poutcome.emp.var.rate.cons.price.idx.cons.conf.idx.euribor3m.nr.employed.y, into = c("age", "job", "marital", "education", "default", "housing", "loan", "contact", "month", "day_of_week",  "duration", "campaign", "pdays", "previous", "poutcome", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employeed", "y"), sep = ";")
kable(head(bank))
age job marital education default housing loan contact month day_of_week duration campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employeed y
56 housemaid married basic.4y no no no telephone may mon 261 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191 no
57 services married high.school unknown no no telephone may mon 149 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191 no
37 services married high.school no yes no telephone may mon 226 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191 no
40 admin. married basic.6y no no no telephone may mon 151 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191 no
56 services married high.school no no yes telephone may mon 307 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191 no
45 services married basic.9y unknown no no telephone may mon 198 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191 no

Now, we will convert each column into its correct data type.

bank[,c("age", "duration", "campaign", "pdays", "previous", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employeed")] <- lapply(bank[,c("age", "duration", "campaign", "pdays", "previous", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employeed")],as.numeric)

bank[,c("job", "marital", "education", "default", "housing", "loan", "contact", "month", "day_of_week", "poutcome", "y")] <- lapply(bank[,c("job", "marital", "education", "default", "housing", "loan", "contact", "month", "day_of_week", "poutcome", "y")],as.factor)

str(bank)
## 'data.frame':    41188 obs. of  21 variables:
##  $ age           : num  56 57 37 40 56 45 59 41 24 25 ...
##  $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
##  $ marital       : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
##  $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
##  $ default       : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
##  $ housing       : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
##  $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
##  $ contact       : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
##  $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ duration      : num  261 149 226 151 307 198 139 217 380 50 ...
##  $ campaign      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays         : num  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ cons.price.idx: num  94 94 94 94 94 ...
##  $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
##  $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
##  $ nr.employeed  : num  5191 5191 5191 5191 5191 ...
##  $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Once the data is in its correct data type, we will check for any missing value.

colSums(is.na(bank))
##            age            job        marital      education        default 
##              0              0              0              0              0 
##        housing           loan        contact          month    day_of_week 
##              0              0              0              0              0 
##       duration       campaign          pdays       previous       poutcome 
##              0              0              0              0              0 
##   emp.var.rate cons.price.idx  cons.conf.idx      euribor3m   nr.employeed 
##              0              0              0              0              0 
##              y 
##              0
anyNA(bank)
## [1] FALSE

Since no data is missing, there will be no need for NA imputation nor deletion.

We will not be focusing on the social and economic context attributes: “emp.var.rate”, “cons.price.idx”, “cons.conf.idx”, “euribor3m”, and “nr.employeed”. Hence, we will drop those columns of data.

bank <- bank[,c(1:15,21)]

Data Inspection and Explanation

Now that there is no missing value and the data is already cleaned, we will be taking a peek into the data and check its size and the columns that we will be analyzing.

dim(bank)
## [1] 41188    16
names(bank)
##  [1] "age"         "job"         "marital"     "education"   "default"    
##  [6] "housing"     "loan"        "contact"     "month"       "day_of_week"
## [11] "duration"    "campaign"    "pdays"       "previous"    "poutcome"   
## [16] "y"

From the lines above, it is known that there are 41188 rows of data which have 16 columns of information.

summary(bank)
##       age                 job            marital     
##  Min.   :17.00   admin.     :10422   divorced: 4612  
##  1st Qu.:32.00   blue-collar: 9254   married :24928  
##  Median :38.00   technician : 6743   single  :11568  
##  Mean   :40.02   services   : 3969   unknown :   80  
##  3rd Qu.:47.00   management : 2924                   
##  Max.   :98.00   retired    : 1720                   
##                  (Other)    : 6156                   
##                education        default         housing           loan      
##  university.degree  :12168   no     :32588   no     :18622   no     :33950  
##  high.school        : 9515   unknown: 8597   unknown:  990   unknown:  990  
##  basic.9y           : 6045   yes    :    3   yes    :21576   yes    : 6248  
##  professional.course: 5243                                                  
##  basic.4y           : 4176                                                  
##  basic.6y           : 2292                                                  
##  (Other)            : 1749                                                  
##       contact          month       day_of_week    duration     
##  cellular :26144   may    :13769   fri:7827    Min.   :   0.0  
##  telephone:15044   jul    : 7174   mon:8514    1st Qu.: 102.0  
##                    aug    : 6178   thu:8623    Median : 180.0  
##                    jun    : 5318   tue:8090    Mean   : 258.3  
##                    nov    : 4101   wed:8134    3rd Qu.: 319.0  
##                    apr    : 2632               Max.   :4918.0  
##                    (Other): 2016                               
##     campaign          pdays          previous            poutcome    
##  Min.   : 1.000   Min.   :  0.0   Min.   :0.000   failure    : 4252  
##  1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.000   nonexistent:35563  
##  Median : 2.000   Median :999.0   Median :0.000   success    : 1373  
##  Mean   : 2.568   Mean   :962.5   Mean   :0.173                      
##  3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.000                      
##  Max.   :56.000   Max.   :999.0   Max.   :7.000                      
##                                                                      
##    y        
##  no :36548  
##  yes: 4640  
##             
##             
##             
##             
## 

From the Summary above, various insights can be obtained:

  1. The average age of the bank clients is approximately 40 years old, with the youngest being 17 years old and the oldest being 98 years old.

  2. The most common type of job among the bank clients is admin work, while the least common job is a retiree.

  3. Among the 41188 bank clients, more than half of them are married.

  4. The bank clients have varying degree of education, but the most common degree of education is a university degree.

  5. Only three bank clients have bank default. On the other hand, over three quarters of the bank clients do not have credit in default while the rest are unknown.

  6. There are more bank clients that are contacted via cellular than those that are contacted via telephone.

  7. Most last contacts with the bank clients were done on May.

  8. The day of the week of the last contact with the bank clients are more or less evenly spread out from monday to friday, with thursday being the most common day.

  9. The average duration of the last contact with the bank clients is 258 seconds, with some of them not picking up and the longest call lasting for 4918 seconds.

  10. The average number of contacts with each and every bank clients are 2.568 times.

  11. Almost no client is contacted in the previous campaigns, with only an average of 0.173 contacts with the bank clients were made prior to this campaign.

  12. Most previous marketing campaign end up with no contact with the bank client. However, among the contacts that happen, many of the resulted in a failure.

  13. The outcome of this campaign is that majority of the bank clients are not subscribed to a term deposit.

Data Exploratory

Assumption

In this data exploratory, it will be assumed that the bank clients are not subscribed to the current term deposit regardless of whether or not they were subscribed to the previous term deposit.(indicated by “success” in the poutcome)

From the perspective of the Portuguese bank institution, the objective of its campaigns are to promote its term deposit and increase the number of clients that subscribe to it. However, are the campaigns really the major factor in the clients’ subscription in term deposit? If yes, then great. If not, why would the bank institution continue this campaign? This is an important point to consider as it deals with a major operational cost and high use of manpower. Unfortunately, due to the lack of detail in the data set, this hypothetical question may not be answered. As such, for the benefit of this exploratory, it is assumed that any subscription to the term deposit is due to the successful campaign and not because of any other reasons.

Now that the assumptions are set, we will be exploring two major components for a successful campaign which are clients and campaign approach which means that this exploratory may be divided into Client Targeting Exploratory and Campaign Approach Exploratory:

Client Targeting Exploratory

Targeting the right client is a crucial aspect as a prospective client will have a higher success rate. In this case, there will be several characteristics that will be analyzed separately to figure out the key characteristics of a prospective client. We will be exploring the clients’ age, job, loans, background education, and previous subscription and how it is related to the campaign success rate.

Is age relevant?

For this question, we will create a new column age_group which is divided into “17-30 years old”, “31-45 years old”, “46-65 years old”, and “more than 65 years old”.

library(dplyr)

bank <- bank %>% 
  mutate(agegroup = case_when(age >= 17  & age <= 30 ~ "17-30 years old",
                              age >= 31  & age <= 45 ~ "31-45 years old",
                              age >= 46  & age <= 65 ~ "46-65 years old",
                              age >= 66 ~ "more than 65 years old"
                              ))

Now that we have divided the data into its respective age group, we will be finding out the campaign success rate for each age group assuming that no other aspects interfere with the success rate.

round(prop.table(table(bank$agegroup, bank$y), margin = 1),4)*100
##                         
##                             no   yes
##   17-30 years old        84.78 15.22
##   31-45 years old        90.63  9.37
##   46-65 years old        89.58 10.42
##   more than 65 years old 53.15 46.85

From the table above, it is known that the least prospective clients have the age group of 31-45 years old with a success rate of lower than 10%. Clients with the age group of 46-65 years old have a similar campaign success rate of 10.42% while the age group of 17-30 years old have a success rate of 15.22%. On the other hand, clients that are more than 65 years old has a very high success rate relative to other age groups, having a success rate of close 46.85% which is almost half.

As shown above, the extreme disparity in success rate due to the difference in age group is apparent for clients that are more than 65 years old. However, this result is relatively unstable as the data size of clients that are more than 65 years old are less than 10% of the data size of the other age groups as shown in the line below. Hence, although this exploratory yields a result that encourage campaigns to clients that are more than 65 years old, it is important to note that it may not always be the case and further data collection may be required.

table(bank$agegroup)
## 
##        17-30 years old        31-45 years old        46-65 years old 
##                   7383                  21974                  11212 
## more than 65 years old 
##                    619

Are the clients’ jobs relevant?

Similar to the previous exploratory, we will be comparing the success rate with respect to the clients’ jobs

round(prop.table(table(bank$job, bank$y), margin = 1),4)*100
##                
##                    no   yes
##   admin.        87.03 12.97
##   blue-collar   93.11  6.89
##   entrepreneur  91.48  8.52
##   housemaid     90.00 10.00
##   management    88.78 11.22
##   retired       74.77 25.23
##   self-employed 89.51 10.49
##   services      91.86  8.14
##   student       68.57 31.43
##   technician    89.17 10.83
##   unemployed    85.80 14.20
##   unknown       88.79 11.21

Based on the results obtained above, the job that has the highest success rate is student with 31.43% and is closely followed by retirees with 25.23%. This is in line with the age group exploratory where clients that are considered senior citizens have an unusually high success rate which may be due to the nature of the term deposit. The least prospective customers have the jobs blue-collar, services, and entrepreneurs with success rates of 6.89%, 8.14%, and 8.52% respectively.

However, there is a questionable insight in which the highest campaign success rate belongs to the students as they are still young and is likely that they do not even have their own personal income yet. Thus, we need to check the data size of the students.

table(bank$job)
## 
##        admin.   blue-collar  entrepreneur     housemaid    management 
##         10422          9254          1456          1060          2924 
##       retired self-employed      services       student    technician 
##          1720          1421          3969           875          6743 
##    unemployed       unknown 
##          1014           330

As expected, with a data size of 875 that is relatively less than the other jobs, the insight should be treated with much consideration as there is a high chance for fluctuations and that it may not reflect the same results in other campaigns. Despite that, due to huge gap in high success rate, campaigns done to students and retirees are highly encouraged.

Are background education relevant?

For this case, we will be exploring the success rate based on the clients’ background education

round(prop.table(table(bank$education, bank$y), margin = 1),4)*100
##                      
##                          no   yes
##   basic.4y            89.75 10.25
##   basic.6y            91.80  8.20
##   basic.9y            92.18  7.82
##   high.school         89.16 10.84
##   illiterate          77.78 22.22
##   professional.course 88.65 11.35
##   university.degree   86.28 13.72
##   unknown             85.50 14.50

It is clear that illiterate clients has the highest success rate of 22.22% followed by clients with a university degree that has a success rate of 13.72%. We will ignore the data under the unknown background education for the benefit of this analysis. Although we gained such insight, there is a false information as there are only 18 illiterate clients which result in the highly fluctuating result and explains the unusually high success rate. In this case, we will also ignore such insight and encourage campaigns to be done to clients with a university degree due to its high relative success rate and its high reliability.

table(bank$education)
## 
##            basic.4y            basic.6y            basic.9y         high.school 
##                4176                2292                6045                9515 
##          illiterate professional.course   university.degree             unknown 
##                  18                5243               12168                1731

Does previous subscription matter?

This question will be exploring the successful previous contact with the client and its effects to the outcome of the current campaign relative to a failed or nonexistent previous contact.

round(prop.table(table(bank$poutcome, bank$y), margin = 1),4)*100
##              
##                  no   yes
##   failure     85.77 14.23
##   nonexistent 91.17  8.83
##   success     34.89 65.11

This is a highly positive insight as the success rate of clients that experience successful previous contact is 65.11% which is in contrast with the success rate of a failed or nonexistent previous contact. Moreover, after checking the data size and discovering that the data size is more than 3% of the whole population, this insight is more reliable than the other insights so far. Hence, it is highly encouraged to focus the campaign to clients that previously experience successful contact.

table(bank$poutcome)
## 
##     failure nonexistent     success 
##        4252       35563        1373

Campaign Approach Exploratory

Unlike the Client Targeting Exploratory, the Campaign Approach Exploratory aims to determine which approach is the most effective in obtaining a term deposit subscription. In this exploratory, we will be focusing on the number of campaign and the contact communication type.

Is the number of contacts relevant?

bank <- bank %>% 
  mutate(contact_range = case_when(
                              campaign == 1 ~ "1",
                              campaign >= 2 & campaign <= 3 ~ "2-3",
                              campaign >= 4 & campaign <= 5 ~ "4-5",
                              campaign >= 6 & campaign <= 10 ~ "6-10",
                              campaign > 10 ~ "More than 10",
                              ))

After separating the number of contacts into its own range, we will explore the effects of the number of contacts to the success rate of the campaign.

round(prop.table(table(bank$contact_range, bank$y), margin = 1),4)*100
##               
##                   no   yes
##   1            86.96 13.04
##   2-3          88.78 11.22
##   4-5          91.32  8.68
##   6-10         93.68  6.32
##   More than 10 96.89  3.11
table(bank$contact_range)
## 
##            1          2-3          4-5         6-10 More than 10 
##        17642        15911         4250         2516          869

In this instance, the number of contact and the campaign success rate has a negative correlation with one contact resulting in the highest success rate of 13.04%.

Does the contact communication type matter?

round(prop.table(table(bank$contact, bank$y), margin = 1),4)*100
##            
##                no   yes
##   cellular  85.26 14.74
##   telephone 94.77  5.23
table(bank$contact)
## 
##  cellular telephone 
##     26144     15044

From this table, both cellular and telephone has a large dataset of 26144 and 15044 which shows that is has a high degree of reliable information. However, there is a huge gap in success rate when categorised based on the contact communication type. Telephone has a very low success rate of 5.23% while cellular has a 14.74% success rate despite it having more data size. Hence it is known that campaign via cellular is more effective than its counterpart.

Conclusion

In conclusion, the Portuguese bank needs to target the right clients to enjoy a high success rate. This includes picking clients with the right age, job, background education, and most importantly, clients that has a successful previous contact. On top of that, the right approach is also needed. Ideally, one contact is made with a cellular to yield the highest success rate.

However, despite all the insights above, all the aspects are related to each other and it is often not possible to only consider one or two aspects separately. Moreover, there are several limitations that serve as a hindrance to the analysis. Thus several recommendations are listed below.

Recommendation

  1. As mentioned before, the hypothetical question of whether or not the successful campaign causes the subscription of the term deposit needs to be answered. It is necessary to consider additional details in future data set such as the duration between the last contact and the date when the client subscribe to the term deposit. From then on, it can be assumed that the subscriptions with any duration that is more than 14 days are not caused by the campaign itself. With such detail, further insights may be obtained.

  2. The clients’ subscription to the term deposit may not always be because of the attractive features of the term deposit itself. It could be because of the employee’s amazing promoting skills, the relationship between the client and the employee, or many others. Hence, it is also important to include the details of the employee that conducted each contact during the campaign to accurately obtain an insight and take action accordingly.

  3. Larger dataset is required to obtain a more reliable insight as some of the insights obtained above may not be a correct representation of other campaigns.

Reference

[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014