Welcome to my LBB.

I usebank.csv dataset from https://archive.ics.uci.edu. It contains data on telemarketing from a bank in Portugal. Using this data, we can predict which prospective customers will buy the product when the bank calls them.

More than one contact/campaign to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

READ DATA

First, we read and check the data structures using read.csv and str().

bank <- read.csv2(file = "bank.csv")
str(bank)
## 'data.frame':    4521 obs. of  17 variables:
##  $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
##  $ job      : chr  "unemployed" "services" "management" "management" ...
##  $ marital  : chr  "married" "married" "single" "married" ...
##  $ education: chr  "primary" "secondary" "tertiary" "tertiary" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
##  $ housing  : chr  "no" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "yes" "no" "yes" ...
##  $ contact  : chr  "cellular" "cellular" "cellular" "unknown" ...
##  $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
##  $ month    : chr  "oct" "may" "apr" "jun" ...
##  $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
##  $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
##  $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome : chr  "unknown" "failure" "failure" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...

There are 17 variables:

  • age : age of customers
  • job : type of job
  • marital : marital status
  • education : level of education
  • default: has credit in default?
  • balance : customers’ balance
  • housing: has housing loan?
  • loan: has personal loan?
  • contact: contact communication type
  • day_of_week: last contact day of the week
  • month: last contact month of year
  • duration: last contact duration, in seconds. Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
  • campaign: number of contacts performed during this campaign and for this client
  • pdays: number of days that passed by after the client was last contacted from a previous campaign
  • previous: number of contacts performed before this campaign and for this client
  • poutcome: outcome of the previous marketing campaign
  • y: Output variable: has the client subscribed a term deposit?

DATA PREPARATION

Coercion

We have to change some data types.

bank[,c("job", "marital", "education")] <- lapply(bank[,c("job", "marital", "education")], as.factor)
bank[,c("age", "balance", "duration")] <- lapply(bank[,c("age", "balance", "duration")], as.numeric)

str(bank)
## 'data.frame':    4521 obs. of  17 variables:
##  $ age      : num  30 33 35 30 59 35 36 39 41 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : num  1787 4789 1350 1476 0 ...
##  $ housing  : chr  "no" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "yes" "no" "yes" ...
##  $ contact  : chr  "cellular" "cellular" "cellular" "unknown" ...
##  $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
##  $ month    : chr  "oct" "may" "apr" "jun" ...
##  $ duration : num  79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
##  $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
##  $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome : chr  "unknown" "failure" "failure" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...

Check Missing Values

colSums(is.na(bank))
##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0
anyNA(bank)
## [1] FALSE

There are no missing values in this dataframe.

EXPLORATORY

Let us begin to explore the data.

Question 1: What are the top 3 of job the customers have?

bjob <- as.data.frame(prop.table(table(bank$job)))

bjob[order(bjob$Freq, decreasing = T),]

Answer1: management, blue-collar and technician are the top 3 of job type

Question2: Are those top 3 also be the mostjob type who subscribed for term deposit product?

bank_yes <- bank[bank$y == "yes",]

yesbjob <- as.data.frame(prop.table(table(bank_yes$job)))

yesbjob[order(yesbjob$Freq, decreasing = T),]

**Answer2 :management,technicianand blue-collar are also the top 3 of job types who subscribed term deposit, although in different order.

Question 3: What type of education who subscribed term deposit the most?

yesedu <- as.data.frame(prop.table(table(bank_yes$education)))

yesedu[order(yesedu$Freq, decreasing = T),]

Answer 3: Customers with secondary education buy the product the most.

Question 4: What about job dan education variables together, who subscribed term deposit the most?

#change column 'y' from binary to numeric

bank_yes$y <- sapply(X = as.character(bank_yes$y),
       FUN = switch,
       "yes" = "1")
          

#add column ynum

bank_yes$ynum <- as.numeric(bank_yes$y)

head(bank_yes)
#Find aggregate of `job`+`education` which have the most `ynum`

bank_yes_agg1 <- aggregate(ynum ~ job + education,
                          data = bank_yes,
                          FUN = sum)

bank_yes_agg1[order(bank_yes_agg1$ynum, decreasing = T),]

Answer 4: Customers with “management” job type and have “tertiary” education subscribed term deposit the most.

Question 5: We want to know the distribution of customer’s age who subscribed term deposit.

Step 1: Let us see the summary of age column.

summary(bank_yes$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.00   32.00   40.00   42.49   50.00   87.00

Step 2: Use boxplot

boxplot(x=bank_yes$age)

Answer 5: From the boxplot, we can see median of customers’ age who buy the product is at 40 y.o. There are outliers at age >78 y.o.

Question 6: How many customers who have balance > 200USD subscribed term deposit?

bank200 <- bank[bank$balance > 200,]
bank200[bank200$y=="yes",]

Answer 6: There are 381 customers who has balance >200 USD subscribed term deposit.

Question 7: We want to know the average of contacts of the previous campaign (previous) which have “success” outcome (poutcome) compared with last campaign.

outcome <- bank[bank$poutcome=="success",]
mean(outcome$previous)
## [1] 3.015504
outcome1 <- bank[bank$y=="yes",]
mean(outcome1$campaign)
## [1] 2.266795

Conclusion:

    • When the bank wants to offer term deposit product, consider that customers who have job as management, tertiary education level, and age between 30-50 y.o. should be called first.
    • Number of contacts performed should be consider 3 times to make the campaign successful.