The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.
library(dplyr)The package “dplyr” comprises many functions that perform mostly used data manipulation operations such as applying filter, selecting specific columns, sorting data, adding or deleting columns and aggregating data. Another most important advantage of this package is that it’s very easy to learn and use dplyr functions.
data = read.csv("E:/Algoritma/1_lbb/bank.csv")
head(data)#> age job marital education default balance housing loan contact day
#> 1 59 admin. married secondary no 2343 yes no unknown 5
#> 2 56 admin. married secondary no 45 no no unknown 5
#> 3 41 technician married secondary no 1270 yes no unknown 5
#> 4 55 services married secondary no 2476 yes no unknown 5
#> 5 54 admin. married tertiary no 184 no no unknown 5
#> 6 42 management single tertiary no 0 yes yes unknown 5
#> month duration campaign pdays previous poutcome deposit
#> 1 may 1042 1 -1 0 unknown yes
#> 2 may 1467 1 -1 0 unknown yes
#> 3 may 1389 1 -1 0 unknown yes
#> 4 may 579 1 -1 0 unknown yes
#> 5 may 673 2 -1 0 unknown yes
#> 6 may 562 2 -1 0 unknown yes
the table above is a sample of the top 6 data from Bank Marketing Dataset
data_clean <- data %>%
select (-c(default, contact, day, month)) %>%
mutate(
housing = ifelse(housing=="yes",1,0),
loan = ifelse(loan=="yes",1,0),
deposit = ifelse(deposit=="yes",1,0)) %>%
mutate (
job = as.factor(job),
marital = as.factor(marital),
education = as.factor (education),
housing = as.factor(housing),
loan = as.factor(loan),
poutcome = as.factor(poutcome),
deposit = as.integer(deposit)
)
str(data_clean)#> 'data.frame': 11162 obs. of 13 variables:
#> $ age : int 59 56 41 55 54 42 56 60 37 28 ...
#> $ job : Factor w/ 12 levels "admin.","blue-collar",..: 1 1 10 8 1 5 5 6 10 8 ...
#> $ marital : Factor w/ 3 levels "divorced","married",..: 2 2 2 2 2 3 2 1 2 3 ...
#> $ education: Factor w/ 4 levels "primary","secondary",..: 2 2 2 2 3 3 3 2 2 2 ...
#> $ balance : int 2343 45 1270 2476 184 0 830 545 1 5090 ...
#> $ housing : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 2 ...
#> $ loan : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 2 1 1 1 ...
#> $ duration : int 1042 1467 1389 579 673 562 1201 1030 608 1297 ...
#> $ campaign : int 1 1 1 1 2 2 1 1 1 3 ...
#> $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
#> $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
#> $ deposit : int 1 1 1 1 1 1 1 1 1 1 ...
Describe about dataset :
age : age of the customerjob : type of jobmarital : marital statuseducation : education of the customerhousing : customer has housing loan (categorical:
‘no’,‘yes’)loan : customer has personal loan (categorical:
‘no’,‘yes’)duration: last contact duration, in seconds
(numeric).campaign: number of contacts performed during this
campaign and for this clientpdays: number of days that passed by after the client
was last contacted from a previous campaignprevious : number of contacts performed before this
campaign and for this client (numeric)poutcome: outcome of the previous marketing
campaigndeposite: has the client subscribed a term deposit
(binary: ‘yes’,‘no’)data_clean <- data_clean[data$balance >= 0,]
head(data_clean)#> age job marital education balance housing loan duration campaign pdays
#> 1 59 admin. married secondary 2343 1 0 1042 1 -1
#> 2 56 admin. married secondary 45 0 0 1467 1 -1
#> 3 41 technician married secondary 1270 1 0 1389 1 -1
#> 4 55 services married secondary 2476 1 0 579 1 -1
#> 5 54 admin. married tertiary 184 0 0 673 2 -1
#> 6 42 management single tertiary 0 1 1 562 2 -1
#> previous poutcome deposit
#> 1 0 unknown 1
#> 2 0 unknown 1
#> 3 0 unknown 1
#> 4 0 unknown 1
#> 5 0 unknown 1
#> 6 0 unknown 1
filtering is used to get customers who have a balance of more than 0
colSums(is.na(data_clean))#> age job marital education balance housing loan duration
#> 0 0 0 0 0 0 0 0
#> campaign pdays previous poutcome deposit
#> 0 0 0 0 0
the dataset that we have does not have a missing value so that further analysis can be carried out and no missing value handling is required
dim(data_clean)#> [1] 10474 13
the data used is 10,474 rows with 13 variables
summary(data_clean)#> age job marital education
#> Min. :18.00 management :2457 divorced:1177 primary :1373
#> 1st Qu.:32.00 blue-collar:1733 married :5966 secondary:5081
#> Median :39.00 technician :1714 single :3331 tertiary :3539
#> Mean :41.34 admin. :1245 unknown : 481
#> 3rd Qu.:49.00 services : 844
#> Max. :95.00 retired : 755
#> (Other) :1726
#> balance housing loan duration campaign
#> Min. : 0.0 0:5702 0:9233 Min. : 2.0 Min. : 1.000
#> 1st Qu.: 183.2 1:4772 1:1241 1st Qu.: 139.0 1st Qu.: 1.000
#> Median : 625.0 Median : 256.0 Median : 2.000
#> Mean : 1650.2 Mean : 371.1 Mean : 2.493
#> 3rd Qu.: 1840.8 3rd Qu.: 492.8 3rd Qu.: 3.000
#> Max. :81204.0 Max. :3881.0 Max. :63.000
#>
#> pdays previous poutcome deposit
#> Min. : -1.00 Min. : 0.0000 failure:1169 Min. :0.0000
#> 1st Qu.: -1.00 1st Qu.: 0.0000 other : 519 1st Qu.:0.0000
#> Median : -1.00 Median : 0.0000 success:1062 Median :0.0000
#> Mean : 52.42 Mean : 0.8572 unknown:7724 Mean :0.4849
#> 3rd Qu.: 71.75 3rd Qu.: 1.0000 3rd Qu.:1.0000
#> Max. :854.00 Max. :58.0000 Max. :1.0000
#>
The output above is a breakdown of the size of the data distribution
sort(table(data_clean$education), decreasing = T)#>
#> secondary tertiary primary unknown
#> 5081 3539 1373 481
It can be seen that customers with secondary education are at most 5,081 people, and followed by tertiaries of 3,539 people with 481 customers whose education is unknown
sort(table(data_clean$job), decreasing = T)#>
#> management blue-collar technician admin. services
#> 2457 1733 1714 1245 844
#> retired self-employed student unemployed entrepreneur
#> 755 387 360 343 301
#> housemaid unknown
#> 265 70
then 2,457 customers work in management, and followed by blue-collar of 1,733 customers. In third place there is a technician of 1,714 customers with 70 customers unknown job
sort(table(data_clean$deposit), decreasing = T)#>
#> 0 1
#> 5395 5079
bank noted that there were 5,079 customers who subscribed to deposits after carrying out promotions by telephone for customers who have a balance of more than 0
sd(data_clean$balance)#> [1] 3291.662
if a customer has a balance above 1,650.2 + 3,291.662 it can be said that the customer belongs to the upper middle class economy. conversely, if a customer has a balance below 1,650.2 - 3,291.662 then the customer is classified as a lower middle class economy
cor(data_clean$balance, data_clean$deposit)#> [1] 0.07109485
for the correlation between the balance variable and the deposit, it can be said that it has no correlation because the correlation value is close to 0, which is only 0.081
cor(data_clean$duration, data_clean$deposit)#> [1] 0.4382886
for the correlation between the duration variable and the deposit it can be said to have a positive correlation even though it is relatively small with a value of only 0.452.
boxplot(data_clean$duration,horizontal = T)from the boxplot it can be seen that there are durations above 1000 seconds
drtn_up1000 = data_clean[data_clean$duration >= 1000,]
sort(table(drtn_up1000$deposit), decreasing = T)#>
#> 1 0
#> 584 70
from the results of data filtering which has a call duration of more than 1000 seconds has a success ratio of attracting customers for subscribe to term deposits yaitu 1:8
Even though duration does not has a large correlation, by giving enough time the marketing team can provide detailed explanations to customers, that will make customers interested for subscribing to term deposits. Sufficient time will also create a two-way conversation and allow potential customers to ask questions about the product being promoted