1. Background

In this analysis, we will see what happened with Direct marketing Campaign from May 2008 to November 2010. The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution.

2. Preparation data

Read data

Data will be set as Factor for any kind except numeric data. Because, the criteria is almost categorical data

library(dplyr)
bank <- read.csv("bank.csv", sep = ";", stringsAsFactors = T)

The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. - Age : Numbers of client’s age

  • job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’)

  • marital : marital status (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed)

  • education : education level (categorical: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’)

  • default : has credit in default? (categorical: ‘no’,‘yes’,‘unknown’)

  • balance : amount of the balance

  • housing : has housing loan? (categorical: ‘no’,‘yes’,‘unknown’)

  • loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)

  • contact: contact communication type (categorical: ‘cellular’,‘telephone’)

  • day : last contact in day

  • month : last contact in month

  • duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

  • campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

  • pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

  • previous: number of contacts performed before this campaign and for this client (numeric)

  • poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)

  • y : has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

Data inspection

glimpse(bank)
#> Rows: 45,211
#> Columns: 17
#> $ age       <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57,…
#> $ job       <fct> management, technician, entrepreneur, blue-collar, unknown,…
#> $ marital   <fct> married, single, married, married, single, married, single,…
#> $ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiary,…
#> $ default   <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no…
#> $ balance   <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 7…
#> $ housing   <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, …
#> $ loan      <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, n…
#> $ contact   <fct> unknown, unknown, unknown, unknown, unknown, unknown, unkno…
#> $ day       <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
#> $ month     <fct> may, may, may, may, may, may, may, may, may, may, may, may,…
#> $ duration  <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517…
#> $ campaign  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,…
#> $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ poutcome  <fct> unknown, unknown, unknown, unknown, unknown, unknown, unkno…
#> $ y         <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no,…
levels(bank$previous)
#> NULL

Data will removed :

  • previous (null information)

Data will format change :

  • day = int to factor

Check missing data

anyNA(bank)
#> [1] FALSE

Data cleansing

bank <- bank %>% 
  select(-previous) %>% 
  mutate(day = as.factor(day))
head(bank)

3. Exploratory Data

summary(bank)
#>       age                 job           marital          education    
#>  Min.   :18.00   blue-collar:9732   divorced: 5207   primary  : 6851  
#>  1st Qu.:33.00   management :9458   married :27214   secondary:23202  
#>  Median :39.00   technician :7597   single  :12790   tertiary :13301  
#>  Mean   :40.94   admin.     :5171                    unknown  : 1857  
#>  3rd Qu.:48.00   services   :4154                                     
#>  Max.   :95.00   retired    :2264                                     
#>                  (Other)    :6835                                     
#>  default        balance       housing      loan            contact     
#>  no :44396   Min.   : -8019   no :20081   no :37967   cellular :29285  
#>  yes:  815   1st Qu.:    72   yes:25130   yes: 7244   telephone: 2906  
#>              Median :   448                           unknown  :13020  
#>              Mean   :  1362                                            
#>              3rd Qu.:  1428                                            
#>              Max.   :102127                                            
#>                                                                        
#>       day            month          duration         campaign     
#>  20     : 2752   may    :13766   Min.   :   0.0   Min.   : 1.000  
#>  18     : 2308   jul    : 6895   1st Qu.: 103.0   1st Qu.: 1.000  
#>  21     : 2026   aug    : 6247   Median : 180.0   Median : 2.000  
#>  17     : 1939   jun    : 5341   Mean   : 258.2   Mean   : 2.764  
#>  6      : 1932   nov    : 3970   3rd Qu.: 319.0   3rd Qu.: 3.000  
#>  5      : 1910   apr    : 2932   Max.   :4918.0   Max.   :63.000  
#>  (Other):32344   (Other): 6060                                    
#>      pdays          poutcome       y        
#>  Min.   : -1.0   failure: 4901   no :39922  
#>  1st Qu.: -1.0   other  : 1840   yes: 5289  
#>  Median : -1.0   success: 1511              
#>  Mean   : 40.2   unknown:36959              
#>  3rd Qu.: -1.0                              
#>  Max.   :871.0                              
#> 

Summary :

  1. Average of client age is around 40 years old

  2. The most job categorical is blue-collar

  3. Direct Marketing campaign have the highest numbers for contact duration in 4918 seconds, and the average of the campaign is around 258 seconds

  4. Cellular is most often for contacting client with 29285 times during the period of Direct Marketing Campaign

4. Business questions :

  1. If average client age is 40, can you breakdown of job categorical? is it stillblue-collar or we can find out other insight from that?
bank_40 <- bank %>% 
  filter(age == 40)
sort(table(bank_40$job), decreasing = T)
#> 
#>   blue-collar    management    technician        admin.      services 
#>           338           268           223           172           148 
#> self-employed    unemployed  entrepreneur     housemaid       retired 
#>            54            48            47            36            15 
#>       student       unknown 
#>             4             2

yes, it is still the same for Top 3 categorical type in 40. the rank as follows :

  • blue-collar

  • management

  • technician

  1. we know the longest campaign duration is 4,918 seconds. We would like to know what category of education if the client getting a call more than average duration call in 258.2 seconds?
bank_edu <-  bank %>% filter(duration >= 258.2)
bank_edu <-  xtabs(campaign ~ education, bank_edu)
bank_edu <-  as.data.frame(bank_edu)
bank_edu <- bank_edu %>% 
  arrange(-Freq)
bank_edu

Know, we know that secondary edu level is the highest frequency of call more than average duration call from 258.2 seconds.

5. Data visualization

DATA OUTLIER

We want to know about the outlier for each job category

library(ggplot2)
ggplot(data = bank, mapping = aes(x = job, y = duration)) +
  geom_boxplot(outlier.shape = NA) +
  geom_point()

What kind of job can available call with longest period time?

bank_job_vis <- bank %>% 
  group_by(job) %>% 
  summarise(duration =mean(duration)) %>% 
  ungroup() %>% 
  arrange(-duration)

bank_job_vis %>% 
  arrange(duration) %>% 
  mutate(job = factor(job, levels = job)) %>% 
  ggplot(mapping = aes(x=job, y=duration)) +
    geom_segment( aes(xend=job, yend=0)) +
    geom_point( size=4, color="orange") +
    coord_flip() +
    theme_bw() +
    xlab("Job Type")+
    ylab("Duration (seconds)")

unemployed is the longest period be able to called, also retired and self-employed. Probably, other job can up calling but in the short time available.