1 Background

After the initial EDA and data understanding in here, I want to continue exploring the data for additional relationships and correlations between the columns in the data. This data was a part of machine learning prediction research aiming to predict based on the attributes of the customers of a certain bank, whether they will accept or reject term deposit offers through telemarketing campaign. Before I attempt to replicate the machine learning prediction, I will attempt to dig additional insights from the data.

2 Data Description

The dataset and its description can be found in this link.

There are four datasets in the link above:

  • bank.csv
  • bank-full.csv
  • bank-additional.csv
  • bank-additional-full.csv

Out of these four, I have decided to work with the bank.csv data as a represen tative of all the four dataset.

The description for each columns are:

  1. age (numeric)
  2. job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’ ,‘technician’,‘unemployed’,‘unknown’)
  3. marital : marital status (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed)
  4. education (categorical: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’)
  5. default: has credit in default? (categorical: ‘no’,‘yes’,‘unknown’)
  6. housing: has housing loan? (categorical: ‘no’,‘yes’,‘unknown’)
  7. loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)
  8. contact: contact communication type (categorical: ‘cellular’,‘telephone’)
  9. month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
  10. day_of_week: last contact day of the week (categorical: ‘mon’,‘tue’,‘wed’,‘thu’,‘fri’)
  11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
  12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  14. previous: number of contacts performed before this campaign and for this client (numeric)
  15. poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)
  16. y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)
  17. balance: not in the original description but is interpreted as each customers’ bank balance at the time of the current campaign

3 Data Preprocessing

# read in the bank.csv
bank <- read.csv("data_input/bank.csv",sep = ";")

# inspect the first six rows of bank and bank_full 
head(bank)
# inspect the data types of bank
str(bank)
#> 'data.frame':    4521 obs. of  17 variables:
#>  $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
#>  $ job      : chr  "unemployed" "services" "management" "management" ...
#>  $ marital  : chr  "married" "married" "single" "married" ...
#>  $ education: chr  "primary" "secondary" "tertiary" "tertiary" ...
#>  $ default  : chr  "no" "no" "no" "no" ...
#>  $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
#>  $ housing  : chr  "no" "yes" "yes" "yes" ...
#>  $ loan     : chr  "no" "yes" "no" "yes" ...
#>  $ contact  : chr  "cellular" "cellular" "cellular" "unknown" ...
#>  $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
#>  $ month    : chr  "oct" "may" "apr" "jun" ...
#>  $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
#>  $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
#>  $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
#>  $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
#>  $ poutcome : chr  "unknown" "failure" "failure" "unknown" ...
#>  $ y        : chr  "no" "no" "no" "no" ...
# check if there is any missing data on bank
anyNA(bank)
#> [1] FALSE
# convert data type char to factor
names <- c("job","marital","education","default","housing","loan","contact","month","poutcome","y")
bank[,names] <- lapply(bank[,names],FUN=as.factor)
str(bank)
#> 'data.frame':    4521 obs. of  17 variables:
#>  $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
#>  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
#>  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
#>  $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
#>  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
#>  $ housing  : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
#>  $ loan     : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
#>  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
#>  $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
#>  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
#>  $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
#>  $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
#>  $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
#>  $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
#>  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
#>  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

4 Data Processing and Visualization

Now that I have preprocessed the data, lets create visualizations to gain additional insights on the data. The first insight I want to get is the relation between bank balance and jobs

# using median as indicator because of the extreme outliers of the balance data
balance_agg <- aggregate(formula = balance~job, data=bank, FUN= median)
ggplot(balance_agg, aes(x=balance, y= reorder(job, balance), fill = job))+
  theme_bw()+
  geom_col()+
  theme(legend.position = "none") +
  labs(title = "relation between jobs and the median of balance", 
       x = "balance in Euros",
       y = "job types")

The data shows that the highest median of bank balance belongs to the retired group while the lowest median belongs to the services group.

What about the maximum of bank balances according to job types ?

balance_agg <- aggregate(formula = balance~job, data=bank, FUN= max)
ggplot(balance_agg, aes(x=balance, y= reorder(job, balance), fill = job))+
  theme_bw()+
  geom_col()+
  theme(legend.position = "none") +
  labs(title = "relation between jobs and the maximum of balance", 
       x = "balance in Euros",
       y = "job types")

The maximum of balance also shows that the group that has the highest balance is the retired group, but this time the lowest maximum balance belong to the unknown group.

The distribution of jobs among the customers looks like this

jobstable <- as.data.frame(table(bank$job))

ggplot(jobstable, aes(x=Freq, y= reorder(Var1, Freq), fill = Var1))+
  theme_bw()+
  geom_col()+
  theme(legend.position = "none") +
  labs(title = "The Number of Customers Based on Their Jobs",
       x = "number of people",
       y = "job types")

As you can see, the three top most job categories are management, blue-collar, and technician with almost 1000 people on the first two and slightly more than 750 people for the third. The number of retired people in the dataset is slightly less than 250 people. Combining the insight from this chart with the previous two shows that although the retired group does not have the most number of people, they have the largest bank balance both by median and by maximum.

The next question to answer is in what job groups the largest number of people accept or reject the telemarketing campaign.

library(dplyr)
library(tidytext)
df1 <- as.data.frame(table(bank$job, bank$y))
df1 <- rename(df1, job = Var1, accept = Var2, num_of_people = Freq)
df1$job <- reorder_within(x = df1$job, by = df1$num_of_people, within = df1$accept)
ggplot(df1, aes(x=num_of_people ,y=reorder(job,num_of_people), fill = accept))+
  geom_col(position = "dodge", show.legend = F) + 
  scale_fill_manual(values=c("firebrick4", "blue")) +
  facet_wrap(~accept,scales="free") +
  scale_y_reordered() + 
  labs(y="jobs",
       title = "number of people who accepted/rejected offer based on their job") +
  theme_bw()

The chart shows that management has the highest number of people who accepted the telemarketing campaign while blue-collar has the highest number of people who rejected the telemarketing campaign.

The two final questions we will consider for now are:

  1. Is there a correlation between a customer’s age and her bank balance ?
  2. Do people above a certain age tend to accept telemarketing offer of term deposits ?
ggplot(bank, aes(x=age,y=balance))+
  geom_point()

From the above chart, there is no correlation between age and bank balance. This is surprising because I would expect that as one enters the age of workforce, one would begin to accumulate savings until the age of retirement (above 65) at which point the bank balance would begin to decrease.

ggplot(bank, aes(x=age,y=y))+
  geom_jitter()

From this chart, we can get the insight that there is no age separation between those who accepted and those who rejected the telemarketing offer. There are both acceptance and rejection of the telemarketing offer from customers of all ages in this dataset.

5 Conclusion

After our attempt at digging deeper for insights in this dataset, we can conclude that there is a slight preference for accepting term deposit offered via telemar keting in the management job group. There is no age group preference in accepting or rejecting telemarketing offer.
Two surprising observations are first, that people in the retirement group has the highest bank balance among all of job groups measured by median and by maximum. Second, there is no correlation between age and bank balance.