The data set used is related with a direct marketing campaigns of a Portuguese banking institution and was taken from May 2008 to November 2010. It has 21 inputs which record the clients’ personal data, data related with the last contact of the current campaign, other attributes, social and economic context attributes and output variable. Further exploratory and detailed analysis of the data will be shown below.
The data used, “bank-additional-full.csv”, can be found in the same folder “LBB1” under the sub folder “data_input” Read the data by running the code below:
library(knitr)
bank <- read.csv("data_input/bank-additional-full.csv")
kable(head(bank))
| age.job.marital.education.default.housing.loan.contact.month.day_of_week.duration.campaign.pdays.previous.poutcome.emp.var.rate.cons.price.idx.cons.conf.idx.euribor3m.nr.employed.y |
|---|
| 56;housemaid;married;basic.4y;no;no;no;telephone;may;mon;261;1;999;0;nonexistent;1.1;93.994;-36.4;4.857;5191;no |
| 57;services;married;high.school;unknown;no;no;telephone;may;mon;149;1;999;0;nonexistent;1.1;93.994;-36.4;4.857;5191;no |
| 37;services;married;high.school;no;yes;no;telephone;may;mon;226;1;999;0;nonexistent;1.1;93.994;-36.4;4.857;5191;no |
| 40;admin.;married;basic.6y;no;no;no;telephone;may;mon;151;1;999;0;nonexistent;1.1;93.994;-36.4;4.857;5191;no |
| 56;services;married;high.school;no;no;yes;telephone;may;mon;307;1;999;0;nonexistent;1.1;93.994;-36.4;4.857;5191;no |
| 45;services;married;basic.9y;unknown;no;no;telephone;may;mon;198;1;999;0;nonexistent;1.1;93.994;-36.4;4.857;5191;no |
It is clear that the data is clumped into one column and needs to be separated into its separate columns.
tidyverse will be used to separate the data into its respective columns.
library(tidyverse)
bank <- bank %>%
separate(col = age.job.marital.education.default.housing.loan.contact.month.day_of_week.duration.campaign.pdays.previous.poutcome.emp.var.rate.cons.price.idx.cons.conf.idx.euribor3m.nr.employed.y, into = c("age", "job", "marital", "education", "default", "housing", "loan", "contact", "month", "day_of_week", "duration", "campaign", "pdays", "previous", "poutcome", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employeed", "y"), sep = ";")
kable(head(bank))
| age | job | marital | education | default | housing | loan | contact | month | day_of_week | duration | campaign | pdays | previous | poutcome | emp.var.rate | cons.price.idx | cons.conf.idx | euribor3m | nr.employeed | y |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 56 | housemaid | married | basic.4y | no | no | no | telephone | may | mon | 261 | 1 | 999 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191 | no |
| 57 | services | married | high.school | unknown | no | no | telephone | may | mon | 149 | 1 | 999 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191 | no |
| 37 | services | married | high.school | no | yes | no | telephone | may | mon | 226 | 1 | 999 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191 | no |
| 40 | admin. | married | basic.6y | no | no | no | telephone | may | mon | 151 | 1 | 999 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191 | no |
| 56 | services | married | high.school | no | no | yes | telephone | may | mon | 307 | 1 | 999 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191 | no |
| 45 | services | married | basic.9y | unknown | no | no | telephone | may | mon | 198 | 1 | 999 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191 | no |
Now, we will convert each column into its correct data type.
bank[,c("age", "duration", "campaign", "pdays", "previous", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employeed")] <- lapply(bank[,c("age", "duration", "campaign", "pdays", "previous", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employeed")],as.numeric)
bank[,c("job", "marital", "education", "default", "housing", "loan", "contact", "month", "day_of_week", "poutcome", "y")] <- lapply(bank[,c("job", "marital", "education", "default", "housing", "loan", "contact", "month", "day_of_week", "poutcome", "y")],as.factor)
str(bank)
## 'data.frame': 41188 obs. of 21 variables:
## $ age : num 56 57 37 40 56 45 59 41 24 25 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
## $ marital : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
## $ education : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
## $ default : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
## $ housing : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
## $ loan : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
## $ contact : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
## $ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ day_of_week : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ duration : num 261 149 226 151 307 198 139 217 380 50 ...
## $ campaign : num 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : num 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : num 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons.price.idx: num 94 94 94 94 94 ...
## $ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
## $ nr.employeed : num 5191 5191 5191 5191 5191 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Once the data is in its correct data type, we will check for any missing value.
colSums(is.na(bank))
## age job marital education default
## 0 0 0 0 0
## housing loan contact month day_of_week
## 0 0 0 0 0
## duration campaign pdays previous poutcome
## 0 0 0 0 0
## emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employeed
## 0 0 0 0 0
## y
## 0
anyNA(bank)
## [1] FALSE
Since no data is missing, there will be no need for NA imputation nor deletion.
We will not be focusing on the social and economic context attributes: “emp.var.rate”, “cons.price.idx”, “cons.conf.idx”, “euribor3m”, and “nr.employeed”. Hence, we will drop those columns of data.
bank <- bank[,c(1:15,21)]
Now that there is no missing value and the data is already cleaned, we will be taking a peek into the data and check its size and the columns that we will be analyzing.
dim(bank)
## [1] 41188 16
names(bank)
## [1] "age" "job" "marital" "education" "default"
## [6] "housing" "loan" "contact" "month" "day_of_week"
## [11] "duration" "campaign" "pdays" "previous" "poutcome"
## [16] "y"
From the lines above, it is known that there are 41188 rows of data which have 16 columns of information.
summary(bank)
## age job marital
## Min. :17.00 admin. :10422 divorced: 4612
## 1st Qu.:32.00 blue-collar: 9254 married :24928
## Median :38.00 technician : 6743 single :11568
## Mean :40.02 services : 3969 unknown : 80
## 3rd Qu.:47.00 management : 2924
## Max. :98.00 retired : 1720
## (Other) : 6156
## education default housing loan
## university.degree :12168 no :32588 no :18622 no :33950
## high.school : 9515 unknown: 8597 unknown: 990 unknown: 990
## basic.9y : 6045 yes : 3 yes :21576 yes : 6248
## professional.course: 5243
## basic.4y : 4176
## basic.6y : 2292
## (Other) : 1749
## contact month day_of_week duration
## cellular :26144 may :13769 fri:7827 Min. : 0.0
## telephone:15044 jul : 7174 mon:8514 1st Qu.: 102.0
## aug : 6178 thu:8623 Median : 180.0
## jun : 5318 tue:8090 Mean : 258.3
## nov : 4101 wed:8134 3rd Qu.: 319.0
## apr : 2632 Max. :4918.0
## (Other): 2016
## campaign pdays previous poutcome
## Min. : 1.000 Min. : 0.0 Min. :0.000 failure : 4252
## 1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.000 nonexistent:35563
## Median : 2.000 Median :999.0 Median :0.000 success : 1373
## Mean : 2.568 Mean :962.5 Mean :0.173
## 3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.000
## Max. :56.000 Max. :999.0 Max. :7.000
##
## y
## no :36548
## yes: 4640
##
##
##
##
##
From the Summary above, various insights can be obtained:
The average age of the bank clients is approximately 40 years old, with the youngest being 17 years old and the oldest being 98 years old.
The most common type of job among the bank clients is admin work, while the least common job is a retiree.
Among the 41188 bank clients, more than half of them are married.
The bank clients have varying degree of education, but the most common degree of education is a university degree.
Only three bank clients have bank default. On the other hand, over three quarters of the bank clients do not have credit in default while the rest are unknown.
There are more bank clients that are contacted via cellular than those that are contacted via telephone.
Most last contacts with the bank clients were done on May.
The day of the week of the last contact with the bank clients are more or less evenly spread out from monday to friday, with thursday being the most common day.
The average duration of the last contact with the bank clients is 258 seconds, with some of them not picking up and the longest call lasting for 4918 seconds.
The average number of contacts with each and every bank clients are 2.568 times.
Almost no client is contacted in the previous campaigns, with only an average of 0.173 contacts with the bank clients were made prior to this campaign.
Most previous marketing campaign end up with no contact with the bank client. However, among the contacts that happen, many of the resulted in a failure.
The outcome of this campaign is that majority of the bank clients are not subscribed to a term deposit.
In this data exploratory, it will be assumed that the bank clients are not subscribed to the current term deposit regardless of whether or not they were subscribed to the previous term deposit.(indicated by “success” in the poutcome)
From the perspective of the Portuguese bank institution, the objective of its campaigns are to promote its term deposit and increase the number of clients that subscribe to it. However, are the campaigns really the major factor in the clients’ subscription in term deposit? If yes, then great. If not, why would the bank institution continue this campaign? This is an important point to consider as it deals with a major operational cost and high use of manpower. Unfortunately, due to the lack of detail in the data set, this hypothetical question may not be answered. As such, for the benefit of this exploratory, it is assumed that any subscription to the term deposit is due to the successful campaign and not because of any other reasons.
Now that the assumptions are set, we will be exploring two major components for a successful campaign which are clients and campaign approach which means that this exploratory may be divided into Client Targeting Exploratory and Campaign Approach Exploratory:
Targeting the right client is a crucial aspect as a prospective client will have a higher success rate. In this case, there will be several characteristics that will be analyzed separately to figure out the key characteristics of a prospective client. We will be exploring the clients’ age, job, loans, background education, and previous subscription and how it is related to the campaign success rate.
Is age relevant?
For this question, we will create a new column age_group which is divided into “17-30 years old”, “31-45 years old”, “46-65 years old”, and “more than 65 years old”.
library(dplyr)
bank <- bank %>%
mutate(agegroup = case_when(age >= 17 & age <= 30 ~ "17-30 years old",
age >= 31 & age <= 45 ~ "31-45 years old",
age >= 46 & age <= 65 ~ "46-65 years old",
age >= 66 ~ "more than 65 years old"
))
Now that we have divided the data into its respective age group, we will be finding out the campaign success rate for each age group assuming that no other aspects interfere with the success rate.
round(prop.table(table(bank$agegroup, bank$y), margin = 1),4)*100
##
## no yes
## 17-30 years old 84.78 15.22
## 31-45 years old 90.63 9.37
## 46-65 years old 89.58 10.42
## more than 65 years old 53.15 46.85
From the table above, it is known that the least prospective clients have the age group of 31-45 years old with a success rate of lower than 10%. Clients with the age group of 46-65 years old have a similar campaign success rate of 10.42% while the age group of 17-30 years old have a success rate of 15.22%. On the other hand, clients that are more than 65 years old has a very high success rate relative to other age groups, having a success rate of close 46.85% which is almost half.
As shown above, the extreme disparity in success rate due to the difference in age group is apparent for clients that are more than 65 years old. However, this result is relatively unstable as the data size of clients that are more than 65 years old are less than 10% of the data size of the other age groups as shown in the line below. Hence, although this exploratory yields a result that encourage campaigns to clients that are more than 65 years old, it is important to note that it may not always be the case and further data collection may be required.
table(bank$agegroup)
##
## 17-30 years old 31-45 years old 46-65 years old
## 7383 21974 11212
## more than 65 years old
## 619
Are the clients’ jobs relevant?
Similar to the previous exploratory, we will be comparing the success rate with respect to the clients’ jobs
round(prop.table(table(bank$job, bank$y), margin = 1),4)*100
##
## no yes
## admin. 87.03 12.97
## blue-collar 93.11 6.89
## entrepreneur 91.48 8.52
## housemaid 90.00 10.00
## management 88.78 11.22
## retired 74.77 25.23
## self-employed 89.51 10.49
## services 91.86 8.14
## student 68.57 31.43
## technician 89.17 10.83
## unemployed 85.80 14.20
## unknown 88.79 11.21
Based on the results obtained above, the job that has the highest success rate is student with 31.43% and is closely followed by retirees with 25.23%. This is in line with the age group exploratory where clients that are considered senior citizens have an unusually high success rate which may be due to the nature of the term deposit. The least prospective customers have the jobs blue-collar, services, and entrepreneurs with success rates of 6.89%, 8.14%, and 8.52% respectively.
However, there is a questionable insight in which the highest campaign success rate belongs to the students as they are still young and is likely that they do not even have their own personal income yet. Thus, we need to check the data size of the students.
table(bank$job)
##
## admin. blue-collar entrepreneur housemaid management
## 10422 9254 1456 1060 2924
## retired self-employed services student technician
## 1720 1421 3969 875 6743
## unemployed unknown
## 1014 330
As expected, with a data size of 875 that is relatively less than the other jobs, the insight should be treated with much consideration as there is a high chance for fluctuations and that it may not reflect the same results in other campaigns. Despite that, due to huge gap in high success rate, campaigns done to students and retirees are highly encouraged.
Are background education relevant?
For this case, we will be exploring the success rate based on the clients’ background education
round(prop.table(table(bank$education, bank$y), margin = 1),4)*100
##
## no yes
## basic.4y 89.75 10.25
## basic.6y 91.80 8.20
## basic.9y 92.18 7.82
## high.school 89.16 10.84
## illiterate 77.78 22.22
## professional.course 88.65 11.35
## university.degree 86.28 13.72
## unknown 85.50 14.50
It is clear that illiterate clients has the highest success rate of 22.22% followed by clients with a university degree that has a success rate of 13.72%. We will ignore the data under the unknown background education for the benefit of this analysis. Although we gained such insight, there is a false information as there are only 18 illiterate clients which result in the highly fluctuating result and explains the unusually high success rate. In this case, we will also ignore such insight and encourage campaigns to be done to clients with a university degree due to its high relative success rate and its high reliability.
table(bank$education)
##
## basic.4y basic.6y basic.9y high.school
## 4176 2292 6045 9515
## illiterate professional.course university.degree unknown
## 18 5243 12168 1731
Does previous subscription matter?
This question will be exploring the successful previous contact with the client and its effects to the outcome of the current campaign relative to a failed or nonexistent previous contact.
round(prop.table(table(bank$poutcome, bank$y), margin = 1),4)*100
##
## no yes
## failure 85.77 14.23
## nonexistent 91.17 8.83
## success 34.89 65.11
This is a highly positive insight as the success rate of clients that experience successful previous contact is 65.11% which is in contrast with the success rate of a failed or nonexistent previous contact. Moreover, after checking the data size and discovering that the data size is more than 3% of the whole population, this insight is more reliable than the other insights so far. Hence, it is highly encouraged to focus the campaign to clients that previously experience successful contact.
table(bank$poutcome)
##
## failure nonexistent success
## 4252 35563 1373
Unlike the Client Targeting Exploratory, the Campaign Approach Exploratory aims to determine which approach is the most effective in obtaining a term deposit subscription. In this exploratory, we will be focusing on the number of campaign and the contact communication type.
Is the number of contacts relevant?
bank <- bank %>%
mutate(contact_range = case_when(
campaign == 1 ~ "1",
campaign >= 2 & campaign <= 3 ~ "2-3",
campaign >= 4 & campaign <= 5 ~ "4-5",
campaign >= 6 & campaign <= 10 ~ "6-10",
campaign > 10 ~ "More than 10",
))
After separating the number of contacts into its own range, we will explore the effects of the number of contacts to the success rate of the campaign.
round(prop.table(table(bank$contact_range, bank$y), margin = 1),4)*100
##
## no yes
## 1 86.96 13.04
## 2-3 88.78 11.22
## 4-5 91.32 8.68
## 6-10 93.68 6.32
## More than 10 96.89 3.11
table(bank$contact_range)
##
## 1 2-3 4-5 6-10 More than 10
## 17642 15911 4250 2516 869
In this instance, the number of contact and the campaign success rate has a negative correlation with one contact resulting in the highest success rate of 13.04%.
Does the contact communication type matter?
round(prop.table(table(bank$contact, bank$y), margin = 1),4)*100
##
## no yes
## cellular 85.26 14.74
## telephone 94.77 5.23
table(bank$contact)
##
## cellular telephone
## 26144 15044
From this table, both cellular and telephone has a large dataset of 26144 and 15044 which shows that is has a high degree of reliable information. However, there is a huge gap in success rate when categorised based on the contact communication type. Telephone has a very low success rate of 5.23% while cellular has a 14.74% success rate despite it having more data size. Hence it is known that campaign via cellular is more effective than its counterpart.
In conclusion, the Portuguese bank needs to target the right clients to enjoy a high success rate. This includes picking clients with the right age, job, background education, and most importantly, clients that has a successful previous contact. On top of that, the right approach is also needed. Ideally, one contact is made with a cellular to yield the highest success rate.
However, despite all the insights above, all the aspects are related to each other and it is often not possible to only consider one or two aspects separately. Moreover, there are several limitations that serve as a hindrance to the analysis. Thus several recommendations are listed below.
As mentioned before, the hypothetical question of whether or not the successful campaign causes the subscription of the term deposit needs to be answered. It is necessary to consider additional details in future data set such as the duration between the last contact and the date when the client subscribe to the term deposit. From then on, it can be assumed that the subscriptions with any duration that is more than 14 days are not caused by the campaign itself. With such detail, further insights may be obtained.
The clients’ subscription to the term deposit may not always be because of the attractive features of the term deposit itself. It could be because of the employee’s amazing promoting skills, the relationship between the client and the employee, or many others. Hence, it is also important to include the details of the employee that conducted each contact during the campaign to accurately obtain an insight and take action accordingly.
Larger dataset is required to obtain a more reliable insight as some of the insights obtained above may not be a correct representation of other campaigns.
[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014