1 Bank Marketing Dataset

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

1.1 Library

library(dplyr)

The package “dplyr” comprises many functions that perform mostly used data manipulation operations such as applying filter, selecting specific columns, sorting data, adding or deleting columns and aggregating data. Another most important advantage of this package is that it’s very easy to learn and use dplyr functions.

2 Data

data = read.csv("E:/Algoritma/1_lbb/bank.csv")
head(data)
#>   age        job marital education default balance housing loan contact day
#> 1  59     admin. married secondary      no    2343     yes   no unknown   5
#> 2  56     admin. married secondary      no      45      no   no unknown   5
#> 3  41 technician married secondary      no    1270     yes   no unknown   5
#> 4  55   services married secondary      no    2476     yes   no unknown   5
#> 5  54     admin. married  tertiary      no     184      no   no unknown   5
#> 6  42 management  single  tertiary      no       0     yes  yes unknown   5
#>   month duration campaign pdays previous poutcome deposit
#> 1   may     1042        1    -1        0  unknown     yes
#> 2   may     1467        1    -1        0  unknown     yes
#> 3   may     1389        1    -1        0  unknown     yes
#> 4   may      579        1    -1        0  unknown     yes
#> 5   may      673        2    -1        0  unknown     yes
#> 6   may      562        2    -1        0  unknown     yes

the table above is a sample of the top 6 data from Bank Marketing Dataset

2.1 Subsetting Data

data_clean <- data %>%
  select (-c(default, contact, day, month)) %>%
  mutate(
    housing = ifelse(housing=="yes",1,0),
    loan = ifelse(loan=="yes",1,0),
    deposit = ifelse(deposit=="yes",1,0)) %>%
  mutate (
    job = as.factor(job),
    marital = as.factor(marital),
    education = as.factor (education),
    housing = as.factor(housing),
    loan = as.factor(loan),
    poutcome = as.factor(poutcome),
    deposit = as.integer(deposit)
  )

str(data_clean)
#> 'data.frame':    11162 obs. of  13 variables:
#>  $ age      : int  59 56 41 55 54 42 56 60 37 28 ...
#>  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 1 1 10 8 1 5 5 6 10 8 ...
#>  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 2 2 2 3 2 1 2 3 ...
#>  $ education: Factor w/ 4 levels "primary","secondary",..: 2 2 2 2 3 3 3 2 2 2 ...
#>  $ balance  : int  2343 45 1270 2476 184 0 830 545 1 5090 ...
#>  $ housing  : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 2 ...
#>  $ loan     : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 2 1 1 1 ...
#>  $ duration : int  1042 1467 1389 579 673 562 1201 1030 608 1297 ...
#>  $ campaign : int  1 1 1 1 2 2 1 1 1 3 ...
#>  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
#>  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
#>  $ deposit  : int  1 1 1 1 1 1 1 1 1 1 ...

Describe about dataset :

  • age : age of the customer
  • job : type of job
  • marital : marital status
  • education : education of the customer
  • housing : customer has housing loan (categorical: ‘no’,‘yes’)
  • loan : customer has personal loan (categorical: ‘no’,‘yes’)
  • duration: last contact duration, in seconds (numeric).
  • campaign: number of contacts performed during this campaign and for this client
  • pdays: number of days that passed by after the client was last contacted from a previous campaign
  • previous : number of contacts performed before this campaign and for this client (numeric)
  • poutcome: outcome of the previous marketing campaign
  • deposite: has the client subscribed a term deposit (binary: ‘yes’,‘no’)

2.2 Filtering Data

data_clean <- data_clean[data$balance >= 0,]
head(data_clean)
#>   age        job marital education balance housing loan duration campaign pdays
#> 1  59     admin. married secondary    2343       1    0     1042        1    -1
#> 2  56     admin. married secondary      45       0    0     1467        1    -1
#> 3  41 technician married secondary    1270       1    0     1389        1    -1
#> 4  55   services married secondary    2476       1    0      579        1    -1
#> 5  54     admin. married  tertiary     184       0    0      673        2    -1
#> 6  42 management  single  tertiary       0       1    1      562        2    -1
#>   previous poutcome deposit
#> 1        0  unknown       1
#> 2        0  unknown       1
#> 3        0  unknown       1
#> 4        0  unknown       1
#> 5        0  unknown       1
#> 6        0  unknown       1

filtering is used to get customers who have a balance of more than 0

2.3 Cleansing Data

colSums(is.na(data_clean))
#>       age       job   marital education   balance   housing      loan  duration 
#>         0         0         0         0         0         0         0         0 
#>  campaign     pdays  previous  poutcome   deposit 
#>         0         0         0         0         0

the dataset that we have does not have a missing value so that further analysis can be carried out and no missing value handling is required

2.4 Dimention of Dataset

dim(data_clean)
#> [1] 10474    13

the data used is 10,474 rows with 13 variables

3 Descriptive statistics

summary(data_clean)
#>       age                 job           marital         education   
#>  Min.   :18.00   management :2457   divorced:1177   primary  :1373  
#>  1st Qu.:32.00   blue-collar:1733   married :5966   secondary:5081  
#>  Median :39.00   technician :1714   single  :3331   tertiary :3539  
#>  Mean   :41.34   admin.     :1245                   unknown  : 481  
#>  3rd Qu.:49.00   services   : 844                                   
#>  Max.   :95.00   retired    : 755                                   
#>                  (Other)    :1726                                   
#>     balance        housing  loan        duration         campaign     
#>  Min.   :    0.0   0:5702   0:9233   Min.   :   2.0   Min.   : 1.000  
#>  1st Qu.:  183.2   1:4772   1:1241   1st Qu.: 139.0   1st Qu.: 1.000  
#>  Median :  625.0                     Median : 256.0   Median : 2.000  
#>  Mean   : 1650.2                     Mean   : 371.1   Mean   : 2.493  
#>  3rd Qu.: 1840.8                     3rd Qu.: 492.8   3rd Qu.: 3.000  
#>  Max.   :81204.0                     Max.   :3881.0   Max.   :63.000  
#>                                                                       
#>      pdays           previous          poutcome       deposit      
#>  Min.   : -1.00   Min.   : 0.0000   failure:1169   Min.   :0.0000  
#>  1st Qu.: -1.00   1st Qu.: 0.0000   other  : 519   1st Qu.:0.0000  
#>  Median : -1.00   Median : 0.0000   success:1062   Median :0.0000  
#>  Mean   : 52.42   Mean   : 0.8572   unknown:7724   Mean   :0.4849  
#>  3rd Qu.: 71.75   3rd Qu.: 1.0000                  3rd Qu.:1.0000  
#>  Max.   :854.00   Max.   :58.0000                  Max.   :1.0000  
#> 

The output above is a breakdown of the size of the data distribution

sort(table(data_clean$education), decreasing = T)
#> 
#> secondary  tertiary   primary   unknown 
#>      5081      3539      1373       481

It can be seen that customers with secondary education are at most 5,081 people, and followed by tertiaries of 3,539 people with 481 customers whose education is unknown

sort(table(data_clean$job), decreasing = T)
#> 
#>    management   blue-collar    technician        admin.      services 
#>          2457          1733          1714          1245           844 
#>       retired self-employed       student    unemployed  entrepreneur 
#>           755           387           360           343           301 
#>     housemaid       unknown 
#>           265            70

then 2,457 customers work in management, and followed by blue-collar of 1,733 customers. In third place there is a technician of 1,714 customers with 70 customers unknown job

sort(table(data_clean$deposit), decreasing = T)
#> 
#>    0    1 
#> 5395 5079

bank noted that there were 5,079 customers who subscribed to deposits after carrying out promotions by telephone for customers who have a balance of more than 0

sd(data_clean$balance)
#> [1] 3291.662

if a customer has a balance above 1,650.2 + 3,291.662 it can be said that the customer belongs to the upper middle class economy. conversely, if a customer has a balance below 1,650.2 - 3,291.662 then the customer is classified as a lower middle class economy

3.1 Correlation with Deposit

cor(data_clean$balance, data_clean$deposit)
#> [1] 0.07109485

for the correlation between the balance variable and the deposit, it can be said that it has no correlation because the correlation value is close to 0, which is only 0.081

cor(data_clean$duration, data_clean$deposit)
#> [1] 0.4382886

for the correlation between the duration variable and the deposit it can be said to have a positive correlation even though it is relatively small with a value of only 0.452.

4 Plot

boxplot(data_clean$duration,horizontal = T)

from the boxplot it can be seen that there are durations above 1000 seconds

drtn_up1000 = data_clean[data_clean$duration >= 1000,]
sort(table(drtn_up1000$deposit), decreasing = T)
#> 
#>   1   0 
#> 584  70

from the results of data filtering which has a call duration of more than 1000 seconds has a success ratio of attracting customers for subscribe to term deposits yaitu 1:8

4.1 Conclusion

Even though duration does not has a large correlation, by giving enough time the marketing team can provide detailed explanations to customers, that will make customers interested for subscribing to term deposits. Sufficient time will also create a two-way conversation and allow potential customers to ask questions about the product being promoted