Telemarketing is one of the ways banks can advertise their product but it is very intrusive and annoying to the customers who receive the unwanted calls. I have never been a telemarketer but I have received many telemarketer calls from unknown numbers. As soon as I know that the person calling is a telemarketer I will end the call after a brief conversation irrespective of whether the person on the other end has talked about their offer or not. I think that telemarketers (of credits without collateral) often target people who can pay for their services but have no need for them. I also think conversely, that people who needs the services usually cannot pay. A research done with the data from a bank in Portugal aimed to predict who, among the customers, will accept the term deposit product offered by that bank’s telemarketer. This data is from here. Using this data, my end goal is to predict the customers who will accept or reject the telemarketer’s offer and compare it with the true result. Before I can do that, I need to familiarize myself with the data and look for patterns inside the data.
There are four datasets in the link above:
The description of each columns can be found in the same link.
The description for each columns are:
The four datasets contain the same columns but with different number of observations. The smallest number of observations is in bank.csv while the largest number of observations is in bank-additional-full.csv. For the purpose of exploratory data analysis I am going to use bank.csv and bank-full.csv.
# read in the bank.csv and bank-full.csv
bank <- read.csv("data_input/bank.csv",sep = ";")
bank_full <- read.csv("data_input/bank-full.csv", sep=";")
# inspect the first six rows of bank and bank_full
head(bank)
head(bank_full)
# inspect the data types of bank and bank_full
str(bank)
#> 'data.frame': 4521 obs. of 17 variables:
#> $ age : int 30 33 35 30 59 35 36 39 41 43 ...
#> $ job : chr "unemployed" "services" "management" "management" ...
#> $ marital : chr "married" "married" "single" "married" ...
#> $ education: chr "primary" "secondary" "tertiary" "tertiary" ...
#> $ default : chr "no" "no" "no" "no" ...
#> $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
#> $ housing : chr "no" "yes" "yes" "yes" ...
#> $ loan : chr "no" "yes" "no" "yes" ...
#> $ contact : chr "cellular" "cellular" "cellular" "unknown" ...
#> $ day : int 19 11 16 3 5 23 14 6 14 17 ...
#> $ month : chr "oct" "may" "apr" "jun" ...
#> $ duration : int 79 220 185 199 226 141 341 151 57 313 ...
#> $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
#> $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
#> $ previous : int 0 4 1 0 0 3 2 0 0 2 ...
#> $ poutcome : chr "unknown" "failure" "failure" "unknown" ...
#> $ y : chr "no" "no" "no" "no" ...
str(bank_full)
#> 'data.frame': 45211 obs. of 17 variables:
#> $ age : int 58 44 33 47 33 35 28 42 58 43 ...
#> $ job : chr "management" "technician" "entrepreneur" "blue-collar" ...
#> $ marital : chr "married" "single" "married" "married" ...
#> $ education: chr "tertiary" "secondary" "secondary" "unknown" ...
#> $ default : chr "no" "no" "no" "no" ...
#> $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
#> $ housing : chr "yes" "yes" "yes" "yes" ...
#> $ loan : chr "no" "no" "yes" "no" ...
#> $ contact : chr "unknown" "unknown" "unknown" "unknown" ...
#> $ day : int 5 5 5 5 5 5 5 5 5 5 ...
#> $ month : chr "may" "may" "may" "may" ...
#> $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
#> $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
#> $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ poutcome : chr "unknown" "unknown" "unknown" "unknown" ...
#> $ y : chr "no" "no" "no" "no" ...
# check if there is any Null or not a number observations on bank and bank_full
anyNA(bank)
#> [1] FALSE
anyNA(bank_full)
#> [1] FALSE
# convert data type char to factor
names <- c("job","marital","education","default","housing","loan","contact","month","poutcome","y")
bank[,names] <- lapply(bank[,names],FUN=as.factor)
str(bank)
#> 'data.frame': 4521 obs. of 17 variables:
#> $ age : int 30 33 35 30 59 35 36 39 41 43 ...
#> $ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
#> $ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
#> $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
#> $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
#> $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
#> $ housing : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
#> $ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
#> $ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
#> $ day : int 19 11 16 3 5 23 14 6 14 17 ...
#> $ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
#> $ duration : int 79 220 185 199 226 141 341 151 57 313 ...
#> $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
#> $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
#> $ previous : int 0 4 1 0 0 3 2 0 0 2 ...
#> $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
#> $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
# read the descriptive summary of each dataset
summary(bank)
#> age job marital education default
#> Min. :19.00 management :969 divorced: 528 primary : 678 no :4445
#> 1st Qu.:33.00 blue-collar:946 married :2797 secondary:2306 yes: 76
#> Median :39.00 technician :768 single :1196 tertiary :1350
#> Mean :41.17 admin. :478 unknown : 187
#> 3rd Qu.:49.00 services :417
#> Max. :87.00 retired :230
#> (Other) :713
#> balance housing loan contact day
#> Min. :-3313 no :1962 no :3830 cellular :2896 Min. : 1.00
#> 1st Qu.: 69 yes:2559 yes: 691 telephone: 301 1st Qu.: 9.00
#> Median : 444 unknown :1324 Median :16.00
#> Mean : 1423 Mean :15.92
#> 3rd Qu.: 1480 3rd Qu.:21.00
#> Max. :71188 Max. :31.00
#>
#> month duration campaign pdays
#> may :1398 Min. : 4 Min. : 1.000 Min. : -1.00
#> jul : 706 1st Qu.: 104 1st Qu.: 1.000 1st Qu.: -1.00
#> aug : 633 Median : 185 Median : 2.000 Median : -1.00
#> jun : 531 Mean : 264 Mean : 2.794 Mean : 39.77
#> nov : 389 3rd Qu.: 329 3rd Qu.: 3.000 3rd Qu.: -1.00
#> apr : 293 Max. :3025 Max. :50.000 Max. :871.00
#> (Other): 571
#> previous poutcome y
#> Min. : 0.0000 failure: 490 no :4000
#> 1st Qu.: 0.0000 other : 197 yes: 521
#> Median : 0.0000 success: 129
#> Mean : 0.5426 unknown:3705
#> 3rd Qu.: 0.0000
#> Max. :25.0000
#>
summary(bank_full)
#> age job marital education
#> Min. :18.00 Length:45211 Length:45211 Length:45211
#> 1st Qu.:33.00 Class :character Class :character Class :character
#> Median :39.00 Mode :character Mode :character Mode :character
#> Mean :40.94
#> 3rd Qu.:48.00
#> Max. :95.00
#> default balance housing loan
#> Length:45211 Min. : -8019 Length:45211 Length:45211
#> Class :character 1st Qu.: 72 Class :character Class :character
#> Mode :character Median : 448 Mode :character Mode :character
#> Mean : 1362
#> 3rd Qu.: 1428
#> Max. :102127
#> contact day month duration
#> Length:45211 Min. : 1.00 Length:45211 Min. : 0.0
#> Class :character 1st Qu.: 8.00 Class :character 1st Qu.: 103.0
#> Mode :character Median :16.00 Mode :character Median : 180.0
#> Mean :15.81 Mean : 258.2
#> 3rd Qu.:21.00 3rd Qu.: 319.0
#> Max. :31.00 Max. :4918.0
#> campaign pdays previous poutcome
#> Min. : 1.000 Min. : -1.0 Min. : 0.0000 Length:45211
#> 1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000 Class :character
#> Median : 2.000 Median : -1.0 Median : 0.0000 Mode :character
#> Mean : 2.764 Mean : 40.2 Mean : 0.5803
#> 3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
#> Max. :63.000 Max. :871.0 Max. :275.0000
#> y
#> Length:45211
#> Class :character
#> Mode :character
#>
#>
#>
From the summary descriptive statistics I can conclude that the summary does not change much from 10 % of the data in the bank object to 100% of the data in the bank_full object. This means the data distribution in bank is already representative of bank_full and I do not need to repeat the EDA for bank_full. From this point onward, I will work only on bank data.
Next, using boxplots, I want to understand if there are any outliers in the numeric data as well as how the interquartile range looks like for that data
boxplot(x=bank$age, data=bank, main="age")
boxplot(x=bank$balance, data=bank, main="balance")
boxplot(x=bank$day, data=bank, main="day of last contact")
boxplot(x=bank$duration, data=bank, main = "duration of contact in seconds")
boxplot(x=bank$campaign, data=bank, main = "number of times a client has been contacted for this campaign")
boxplot(x=bank$pdays, data=bank, main = "number of days a client was last contacted for previous campaign", sub = "999 means never contacted")
boxplot(x=bank$previous, data=bank, main = "number of times a client has been contacted for previous campaign")
From these boxplots I can see that:
# This chunk is to demomstrate how to display a boxplot on all numeric columns in a dataset with all numeric columns scaled using z-score.
# bank2 <- bank
# ind <- sapply(bank2, is.numeric)
# bank2[ind] <- lapply(bank2[ind], scale)
# boxplot(x=bank2[,ind])
What about the proportions of customers in each level in each factor type columns ?
plot(bank$job, main="job", las=2)
There are 3 job categories which a large number of bank customers belong to, blue-collar, management, and technician. The largest of the three is management followed by blue-collar and technician.
plot(bank$education, main="education")
The largest number of the customers are of secondary education (high school).
plot(bank$default, main = "customers who defaulted on credit")
Most customers in the dataset have never defaulted on credit payment.
plot(bank$loan, main = "having a loan besides housing")
plot(bank$contact, main = "contact method")
The telemarketing campaign’s most used method was through cellular telephone.
plot(bank$marital, main = "marital status", sub = "divorced includes widowed status")
Most of the customers are married, and divorced or widowed status is the smallest.
plot(bank$housing, main = "customers with housing loan")
What is interesting is that the difference of proportions between customers who had a housing loan and had not is relatively small, only about 500 people. But the customers who had a housing loan is more than those who had not.
plot(bank$month, main = "month of last contact")
The highest number of customers were last contacted in Nay. I assume this is for the last campaign.
plot(bank$poutcome, main = "outcome of previous campaign")
From the previous campaign, the proportion of customers who subscribed to term deposit is the lowest. with the highest proportion being unknown.
plot(bank$y, main = "yes or no to the term deposit")
The number of customers who rejected the term deposit in the current campaign is far larger than those who accepted; less than 500 people accepted the offer.
# check the distribution of bank balances of the customers
hist(x = bank$balance,breaks = 8, main = "Distribution of Bank Balance",xlab= "bank balance", ylab= "number of people")
It seems that more than 300 customers have a balance equal or less than 10000 euros and less than 100 customers have negative balance. The number of customers with balance more than 20000 is very small.
What I can conclude from this initial EDA is that: