After the initial EDA and data understanding in here, I want to continue exploring the data for additional relationships and correlations between the columns in the data. This data was a part of machine learning prediction research aiming to predict based on the attributes of the customers of a certain bank, whether they will accept or reject term deposit offers through telemarketing campaign. Before I attempt to replicate the machine learning prediction, I will attempt to dig additional insights from the data.
The dataset and its description can be found in this link.
There are four datasets in the link above:
Out of these four, I have decided to work with the bank.csv data as a represen tative of all the four dataset.
The description for each columns are:
# read in the bank.csv
bank <- read.csv("data_input/bank.csv",sep = ";")
# inspect the first six rows of bank and bank_full
head(bank)
# inspect the data types of bank
str(bank)
#> 'data.frame': 4521 obs. of 17 variables:
#> $ age : int 30 33 35 30 59 35 36 39 41 43 ...
#> $ job : chr "unemployed" "services" "management" "management" ...
#> $ marital : chr "married" "married" "single" "married" ...
#> $ education: chr "primary" "secondary" "tertiary" "tertiary" ...
#> $ default : chr "no" "no" "no" "no" ...
#> $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
#> $ housing : chr "no" "yes" "yes" "yes" ...
#> $ loan : chr "no" "yes" "no" "yes" ...
#> $ contact : chr "cellular" "cellular" "cellular" "unknown" ...
#> $ day : int 19 11 16 3 5 23 14 6 14 17 ...
#> $ month : chr "oct" "may" "apr" "jun" ...
#> $ duration : int 79 220 185 199 226 141 341 151 57 313 ...
#> $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
#> $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
#> $ previous : int 0 4 1 0 0 3 2 0 0 2 ...
#> $ poutcome : chr "unknown" "failure" "failure" "unknown" ...
#> $ y : chr "no" "no" "no" "no" ...
# check if there is any missing data on bank
anyNA(bank)
#> [1] FALSE
# convert data type char to factor
names <- c("job","marital","education","default","housing","loan","contact","month","poutcome","y")
bank[,names] <- lapply(bank[,names],FUN=as.factor)
str(bank)
#> 'data.frame': 4521 obs. of 17 variables:
#> $ age : int 30 33 35 30 59 35 36 39 41 43 ...
#> $ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
#> $ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
#> $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
#> $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
#> $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
#> $ housing : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
#> $ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
#> $ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
#> $ day : int 19 11 16 3 5 23 14 6 14 17 ...
#> $ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
#> $ duration : int 79 220 185 199 226 141 341 151 57 313 ...
#> $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
#> $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
#> $ previous : int 0 4 1 0 0 3 2 0 0 2 ...
#> $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
#> $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Now that I have preprocessed the data, lets create visualizations to gain additional insights on the data. The first insight I want to get is the relation between bank balance and jobs
# using median as indicator because of the extreme outliers of the balance data
balance_agg <- aggregate(formula = balance~job, data=bank, FUN= median)
ggplot(balance_agg, aes(x=balance, y= reorder(job, balance), fill = job))+
theme_bw()+
geom_col()+
theme(legend.position = "none") +
labs(title = "relation between jobs and the median of balance",
x = "balance in Euros",
y = "job types")
The data shows that the highest median of bank balance belongs to the retired group while the lowest median belongs to the services group.
What about the maximum of bank balances according to job types ?
balance_agg <- aggregate(formula = balance~job, data=bank, FUN= max)
ggplot(balance_agg, aes(x=balance, y= reorder(job, balance), fill = job))+
theme_bw()+
geom_col()+
theme(legend.position = "none") +
labs(title = "relation between jobs and the maximum of balance",
x = "balance in Euros",
y = "job types")
The maximum of balance also shows that the group that has the highest balance is the retired group, but this time the lowest maximum balance belong to the unknown group.
The distribution of jobs among the customers looks like this
jobstable <- as.data.frame(table(bank$job))
ggplot(jobstable, aes(x=Freq, y= reorder(Var1, Freq), fill = Var1))+
theme_bw()+
geom_col()+
theme(legend.position = "none") +
labs(title = "The Number of Customers Based on Their Jobs",
x = "number of people",
y = "job types")
As you can see, the three top most job categories are management, blue-collar, and technician with almost 1000 people on the first two and slightly more than 750 people for the third. The number of retired people in the dataset is slightly less than 250 people. Combining the insight from this chart with the previous two shows that although the retired group does not have the most number of people, they have the largest bank balance both by median and by maximum.
The next question to answer is in what job groups the largest number of people accept or reject the telemarketing campaign.
library(dplyr)
library(tidytext)
df1 <- as.data.frame(table(bank$job, bank$y))
df1 <- rename(df1, job = Var1, accept = Var2, num_of_people = Freq)
df1$job <- reorder_within(x = df1$job, by = df1$num_of_people, within = df1$accept)
ggplot(df1, aes(x=num_of_people ,y=reorder(job,num_of_people), fill = accept))+
geom_col(position = "dodge", show.legend = F) +
scale_fill_manual(values=c("firebrick4", "blue")) +
facet_wrap(~accept,scales="free") +
scale_y_reordered() +
labs(y="jobs",
title = "number of people who accepted/rejected offer based on their job") +
theme_bw()
The chart shows that management has the highest number of people who accepted the telemarketing campaign while blue-collar has the highest number of people who rejected the telemarketing campaign.
The two final questions we will consider for now are:
ggplot(bank, aes(x=age,y=balance))+
geom_point()
From the above chart, there is no correlation between age and bank balance. This is surprising because I would expect that as one enters the age of workforce, one would begin to accumulate savings until the age of retirement (above 65) at which point the bank balance would begin to decrease.
ggplot(bank, aes(x=age,y=y))+
geom_jitter()
From this chart, we can get the insight that there is no age separation between those who accepted and those who rejected the telemarketing offer. There are both acceptance and rejection of the telemarketing offer from customers of all ages in this dataset.
After our attempt at digging deeper for insights in this dataset, we can conclude that there is a slight preference for accepting term deposit offered via telemar keting in the management job group. There is no age group preference in accepting or rejecting telemarketing offer.
Two surprising observations are first, that people in the retirement group has the highest bank balance among all of job groups measured by median and by maximum. Second, there is no correlation between age and bank balance.