1 Background

Telemarketing is one of the ways banks can advertise their product but it is very intrusive and annoying to the customers who receive the unwanted calls. I have never been a telemarketer but I have received many telemarketer calls from unknown numbers. As soon as I know that the person calling is a telemarketer I will end the call after a brief conversation irrespective of whether the person on the other end has talked about their offer or not. I think that telemarketers (of credits without collateral) often target people who can pay for their services but have no need for them. I also think conversely, that people who needs the services usually cannot pay. A research done with the data from a bank in Portugal aimed to predict who, among the customers, will accept the term deposit product offered by that bank’s telemarketer. This data is from here. Using this data, my end goal is to predict the customers who will accept or reject the telemarketer’s offer and compare it with the true result. Before I can do that, I need to familiarize myself with the data and look for patterns inside the data.

2 Data description

There are four datasets in the link above:

  • bank.csv
  • bank-full.csv
  • bank-additional.csv
  • bank-additional-full.csv

The description of each columns can be found in the same link.

The description for each columns are:

  1. age (numeric)
  2. job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’ ,‘technician’,‘unemployed’,‘unknown’)
  3. marital : marital status (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed)
  4. education (categorical: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’)
  5. default: has credit in default? (categorical: ‘no’,‘yes’,‘unknown’)
  6. housing: has housing loan? (categorical: ‘no’,‘yes’,‘unknown’)
  7. loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)
  8. contact: contact communication type (categorical: ‘cellular’,‘telephone’)
  9. month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
  10. day_of_week: last contact day of the week (categorical: ‘mon’,‘tue’,‘wed’,‘thu’,‘fri’)
  11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
  12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  14. previous: number of contacts performed before this campaign and for this client (numeric)
  15. poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)
  16. y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)
  17. balance: not in the original description but is interpreted as each customers’ bank balance at the time of the current campaign

The four datasets contain the same columns but with different number of observations. The smallest number of observations is in bank.csv while the largest number of observations is in bank-additional-full.csv. For the purpose of exploratory data analysis I am going to use bank.csv and bank-full.csv.

3 Data preprocessing

# read in the bank.csv and bank-full.csv
bank <- read.csv("data_input/bank.csv",sep = ";")
bank_full <- read.csv("data_input/bank-full.csv", sep=";")
# inspect the first six rows of bank and bank_full 
head(bank)
head(bank_full)
# inspect the data types of bank and bank_full
str(bank)
#> 'data.frame':    4521 obs. of  17 variables:
#>  $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
#>  $ job      : chr  "unemployed" "services" "management" "management" ...
#>  $ marital  : chr  "married" "married" "single" "married" ...
#>  $ education: chr  "primary" "secondary" "tertiary" "tertiary" ...
#>  $ default  : chr  "no" "no" "no" "no" ...
#>  $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
#>  $ housing  : chr  "no" "yes" "yes" "yes" ...
#>  $ loan     : chr  "no" "yes" "no" "yes" ...
#>  $ contact  : chr  "cellular" "cellular" "cellular" "unknown" ...
#>  $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
#>  $ month    : chr  "oct" "may" "apr" "jun" ...
#>  $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
#>  $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
#>  $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
#>  $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
#>  $ poutcome : chr  "unknown" "failure" "failure" "unknown" ...
#>  $ y        : chr  "no" "no" "no" "no" ...
str(bank_full)
#> 'data.frame':    45211 obs. of  17 variables:
#>  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
#>  $ job      : chr  "management" "technician" "entrepreneur" "blue-collar" ...
#>  $ marital  : chr  "married" "single" "married" "married" ...
#>  $ education: chr  "tertiary" "secondary" "secondary" "unknown" ...
#>  $ default  : chr  "no" "no" "no" "no" ...
#>  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
#>  $ housing  : chr  "yes" "yes" "yes" "yes" ...
#>  $ loan     : chr  "no" "no" "yes" "no" ...
#>  $ contact  : chr  "unknown" "unknown" "unknown" "unknown" ...
#>  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
#>  $ month    : chr  "may" "may" "may" "may" ...
#>  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
#>  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
#>  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ poutcome : chr  "unknown" "unknown" "unknown" "unknown" ...
#>  $ y        : chr  "no" "no" "no" "no" ...
# check if there is any Null or not a number observations on bank and bank_full
anyNA(bank)
#> [1] FALSE
anyNA(bank_full)
#> [1] FALSE
# convert data type char to factor
names <- c("job","marital","education","default","housing","loan","contact","month","poutcome","y")
bank[,names] <- lapply(bank[,names],FUN=as.factor)
str(bank)
#> 'data.frame':    4521 obs. of  17 variables:
#>  $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
#>  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
#>  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
#>  $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
#>  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
#>  $ housing  : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
#>  $ loan     : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
#>  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
#>  $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
#>  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
#>  $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
#>  $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
#>  $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
#>  $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
#>  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
#>  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
# read the descriptive summary of each dataset
summary(bank)
#>       age                 job          marital         education    default   
#>  Min.   :19.00   management :969   divorced: 528   primary  : 678   no :4445  
#>  1st Qu.:33.00   blue-collar:946   married :2797   secondary:2306   yes:  76  
#>  Median :39.00   technician :768   single  :1196   tertiary :1350             
#>  Mean   :41.17   admin.     :478                   unknown  : 187             
#>  3rd Qu.:49.00   services   :417                                              
#>  Max.   :87.00   retired    :230                                              
#>                  (Other)    :713                                              
#>     balance      housing     loan           contact          day       
#>  Min.   :-3313   no :1962   no :3830   cellular :2896   Min.   : 1.00  
#>  1st Qu.:   69   yes:2559   yes: 691   telephone: 301   1st Qu.: 9.00  
#>  Median :  444                         unknown  :1324   Median :16.00  
#>  Mean   : 1423                                          Mean   :15.92  
#>  3rd Qu.: 1480                                          3rd Qu.:21.00  
#>  Max.   :71188                                          Max.   :31.00  
#>                                                                        
#>      month         duration       campaign          pdays       
#>  may    :1398   Min.   :   4   Min.   : 1.000   Min.   : -1.00  
#>  jul    : 706   1st Qu.: 104   1st Qu.: 1.000   1st Qu.: -1.00  
#>  aug    : 633   Median : 185   Median : 2.000   Median : -1.00  
#>  jun    : 531   Mean   : 264   Mean   : 2.794   Mean   : 39.77  
#>  nov    : 389   3rd Qu.: 329   3rd Qu.: 3.000   3rd Qu.: -1.00  
#>  apr    : 293   Max.   :3025   Max.   :50.000   Max.   :871.00  
#>  (Other): 571                                                   
#>     previous          poutcome      y       
#>  Min.   : 0.0000   failure: 490   no :4000  
#>  1st Qu.: 0.0000   other  : 197   yes: 521  
#>  Median : 0.0000   success: 129             
#>  Mean   : 0.5426   unknown:3705             
#>  3rd Qu.: 0.0000                            
#>  Max.   :25.0000                            
#> 
summary(bank_full)
#>       age            job              marital           education        
#>  Min.   :18.00   Length:45211       Length:45211       Length:45211      
#>  1st Qu.:33.00   Class :character   Class :character   Class :character  
#>  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
#>  Mean   :40.94                                                           
#>  3rd Qu.:48.00                                                           
#>  Max.   :95.00                                                           
#>    default             balance         housing              loan          
#>  Length:45211       Min.   : -8019   Length:45211       Length:45211      
#>  Class :character   1st Qu.:    72   Class :character   Class :character  
#>  Mode  :character   Median :   448   Mode  :character   Mode  :character  
#>                     Mean   :  1362                                        
#>                     3rd Qu.:  1428                                        
#>                     Max.   :102127                                        
#>    contact               day           month              duration     
#>  Length:45211       Min.   : 1.00   Length:45211       Min.   :   0.0  
#>  Class :character   1st Qu.: 8.00   Class :character   1st Qu.: 103.0  
#>  Mode  :character   Median :16.00   Mode  :character   Median : 180.0  
#>                     Mean   :15.81                      Mean   : 258.2  
#>                     3rd Qu.:21.00                      3rd Qu.: 319.0  
#>                     Max.   :31.00                      Max.   :4918.0  
#>     campaign          pdays          previous          poutcome        
#>  Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000   Length:45211      
#>  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000   Class :character  
#>  Median : 2.000   Median : -1.0   Median :  0.0000   Mode  :character  
#>  Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803                     
#>  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000                     
#>  Max.   :63.000   Max.   :871.0   Max.   :275.0000                     
#>       y            
#>  Length:45211      
#>  Class :character  
#>  Mode  :character  
#>                    
#>                    
#> 

From the summary descriptive statistics I can conclude that the summary does not change much from 10 % of the data in the bank object to 100% of the data in the bank_full object. This means the data distribution in bank is already representative of bank_full and I do not need to repeat the EDA for bank_full. From this point onward, I will work only on bank data.

Next, using boxplots, I want to understand if there are any outliers in the numeric data as well as how the interquartile range looks like for that data

4 Initial EDA to understand the data

boxplot(x=bank$age, data=bank, main="age")

boxplot(x=bank$balance, data=bank, main="balance")

boxplot(x=bank$day, data=bank, main="day of last contact")

boxplot(x=bank$duration, data=bank, main = "duration of contact in seconds")

boxplot(x=bank$campaign, data=bank, main = "number of times a client has been contacted for this campaign")

boxplot(x=bank$pdays, data=bank, main = "number of days a client was last contacted for previous campaign", sub = "999 means never contacted")

boxplot(x=bank$previous, data=bank, main = "number of times a client has been contacted for previous campaign")

From these boxplots I can see that:

  • Most of the data have outliers except day.
  • The interquartile range for the numeric data is very narrow,except for day and age.
# This chunk is to demomstrate how to display a boxplot on all numeric columns in a dataset with all numeric columns scaled using z-score.

# bank2 <- bank
# ind <- sapply(bank2, is.numeric)
# bank2[ind] <- lapply(bank2[ind], scale)
# boxplot(x=bank2[,ind])

What about the proportions of customers in each level in each factor type columns ?

plot(bank$job, main="job", las=2)

There are 3 job categories which a large number of bank customers belong to, blue-collar, management, and technician. The largest of the three is management followed by blue-collar and technician.

plot(bank$education, main="education")

The largest number of the customers are of secondary education (high school).

plot(bank$default, main = "customers who defaulted on credit")

Most customers in the dataset have never defaulted on credit payment.

plot(bank$loan, main = "having a loan besides housing")

plot(bank$contact, main = "contact method")

The telemarketing campaign’s most used method was through cellular telephone.

plot(bank$marital, main = "marital status", sub = "divorced includes widowed status")

Most of the customers are married, and divorced or widowed status is the smallest.

plot(bank$housing, main = "customers with housing loan")

What is interesting is that the difference of proportions between customers who had a housing loan and had not is relatively small, only about 500 people. But the customers who had a housing loan is more than those who had not.

plot(bank$month, main = "month of last contact")

The highest number of customers were last contacted in Nay. I assume this is for the last campaign.

plot(bank$poutcome, main =  "outcome of previous campaign")

From the previous campaign, the proportion of customers who subscribed to term deposit is the lowest. with the highest proportion being unknown.

plot(bank$y, main = "yes or no to the term deposit")

The number of customers who rejected the term deposit in the current campaign is far larger than those who accepted; less than 500 people accepted the offer.

# check the distribution of bank balances of the customers
hist(x = bank$balance,breaks = 8, main = "Distribution of Bank Balance",xlab= "bank balance", ylab= "number of people")

It seems that more than 300 customers have a balance equal or less than 10000 euros and less than 100 customers have negative balance. The number of customers with balance more than 20000 is very small.

5 Conclusion

What I can conclude from this initial EDA is that:

  1. The data is already clean in regards to missing values.
  2. All the character type data can and have been converted to factor type data.
  3. Most of the numeric data have outliers and narrow IQR.
  4. Most customers have bank balance less than 1500 euros with some outliers have more than 70000 euros.
  5. Most customers decline the telemarketing offer.