LBB Programming for Data Science

Welcome to my LBB.

I usebank.csv dataset from https://archive.ics.uci.edu. It contains data on telemarketing from a bank in Portugal. Using this data, we can predict which prospective customers will buy the product when the bank calls them.

More than one contact/campaign to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

READ DATA

First, we read and check the data structures using read.csv and str().

bank <- read.csv2(file = "bank.csv")
str(bank)

## 'data.frame':    4521 obs. of  17 variables:
##  $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
##  $ job      : chr  "unemployed" "services" "management" "management" ...
##  $ marital  : chr  "married" "married" "single" "married" ...
##  $ education: chr  "primary" "secondary" "tertiary" "tertiary" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
##  $ housing  : chr  "no" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "yes" "no" "yes" ...
##  $ contact  : chr  "cellular" "cellular" "cellular" "unknown" ...
##  $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
##  $ month    : chr  "oct" "may" "apr" "jun" ...
##  $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
##  $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
##  $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome : chr  "unknown" "failure" "failure" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...

There are 17 variables:

age : age of customers
job : type of job
marital : marital status
education : level of education
default: has credit in default?
balance : customers’ balance
housing: has housing loan?
loan: has personal loan?
contact: contact communication type
day_of_week: last contact day of the week
month: last contact month of year
duration: last contact duration, in seconds. Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
campaign: number of contacts performed during this campaign and for this client
pdays: number of days that passed by after the client was last contacted from a previous campaign
previous: number of contacts performed before this campaign and for this client
poutcome: outcome of the previous marketing campaign
y: Output variable: has the client subscribed a term deposit?

DATA PREPARATION

Coercion

We have to change some data types.

bank[,c("job", "marital", "education")] <- lapply(bank[,c("job", "marital", "education")], as.factor)
bank[,c("age", "balance", "duration")] <- lapply(bank[,c("age", "balance", "duration")], as.numeric)

str(bank)

## 'data.frame':    4521 obs. of  17 variables:
##  $ age      : num  30 33 35 30 59 35 36 39 41 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : num  1787 4789 1350 1476 0 ...
##  $ housing  : chr  "no" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "yes" "no" "yes" ...
##  $ contact  : chr  "cellular" "cellular" "cellular" "unknown" ...
##  $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
##  $ month    : chr  "oct" "may" "apr" "jun" ...
##  $ duration : num  79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
##  $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
##  $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome : chr  "unknown" "failure" "failure" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...

Check Missing Values

colSums(is.na(bank))

##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0

anyNA(bank)

## [1] FALSE

There are no missing values in this dataframe.

EXPLORATORY

Let us begin to explore the data.

Question 1: What are the top 3 of `job` the customers have?

bjob <- as.data.frame(prop.table(table(bank$job)))

bjob[order(bjob$Freq, decreasing = T),]

Answer1: management, blue-collar and technician are the top 3 of job type

Question2: Are those top 3 also be the most`job` type who subscribed for term deposit product?

bank_yes <- bank[bank$y == "yes",]

yesbjob <- as.data.frame(prop.table(table(bank_yes$job)))

yesbjob[order(yesbjob$Freq, decreasing = T),]

**Answer2 :management,technicianand blue-collar are also the top 3 of job types who subscribed term deposit, although in different order.

Question 3: What type of `education` who subscribed term deposit the most?

yesedu <- as.data.frame(prop.table(table(bank_yes$education)))

yesedu[order(yesedu$Freq, decreasing = T),]

Answer 3: Customers with secondary education buy the product the most.

Question 4: What about `job` dan `education` variables together, who subscribed term deposit the most?

#change column 'y' from binary to numeric

bank_yes$y <- sapply(X = as.character(bank_yes$y),
       FUN = switch,
       "yes" = "1")
          

#add column ynum

bank_yes$ynum <- as.numeric(bank_yes$y)

head(bank_yes)

#Find aggregate of `job`+`education` which have the most `ynum`

bank_yes_agg1 <- aggregate(ynum ~ job + education,
                          data = bank_yes,
                          FUN = sum)

bank_yes_agg1[order(bank_yes_agg1$ynum, decreasing = T),]

Answer 4: Customers with “management” job type and have “tertiary” education subscribed term deposit the most.

Question 5: We want to know the distribution of customer’s age who subscribed term deposit.

Step 1: Let us see the summary of age column.

summary(bank_yes$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.00   32.00   40.00   42.49   50.00   87.00

Step 2: Use boxplot

boxplot(x=bank_yes$age)

Answer 5: From the boxplot, we can see median of customers’ age who buy the product is at 40 y.o. There are outliers at age >78 y.o.

Question 6: How many customers who have `balance` > 200USD subscribed term deposit?

bank200 <- bank[bank$balance > 200,]
bank200[bank200$y=="yes",]

Answer 6: There are 381 customers who has balance >200 USD subscribed term deposit.

Question 7: We want to know the average of contacts of the previous campaign (`previous`) which have “success” outcome (`poutcome`) compared with last campaign.

outcome <- bank[bank$poutcome=="success",]
mean(outcome$previous)

## [1] 3.015504

outcome1 <- bank[bank$y=="yes",]
mean(outcome1$campaign)

## [1] 2.266795

Conclusion:

- When the bank wants to offer term deposit product, consider that customers who have job as management, tertiary education level, and age between 30-50 y.o. should be called first.
- Number of contacts performed should be consider 3 times to make the campaign successful.

LBB Programming for Data Science

Bidhari Andana

15 Oktober, 2022

READ DATA

DATA PREPARATION

Coercion

Check Missing Values

EXPLORATORY

Question 1: What are the top 3 of `job` the customers have?

Question2: Are those top 3 also be the most`job` type who subscribed for term deposit product?

Question 3: What type of `education` who subscribed term deposit the most?

Question 4: What about `job` dan `education` variables together, who subscribed term deposit the most?

Question 5: We want to know the distribution of customer’s age who subscribed term deposit.

Question 6: How many customers who have `balance` > 200USD subscribed term deposit?

Question 7: We want to know the average of contacts of the previous campaign (`previous`) which have “success” outcome (`poutcome`) compared with last campaign.

Conclusion:

LBB Programming for Data Science

Bidhari Andana

15 Oktober, 2022

READ DATA

DATA PREPARATION

Coercion

Check Missing Values

EXPLORATORY

Question 1: What are the top 3 of job the customers have?

Question2: Are those top 3 also be the mostjob type who subscribed for term deposit product?

Question 3: What type of education who subscribed term deposit the most?

Question 4: What about job dan education variables together, who subscribed term deposit the most?

Question 5: We want to know the distribution of customer’s age who subscribed term deposit.

Question 6: How many customers who have balance > 200USD subscribed term deposit?

Question 7: We want to know the average of contacts of the previous campaign (previous) which have “success” outcome (poutcome) compared with last campaign.

Conclusion:

Question 1: What are the top 3 of `job` the customers have?

Question2: Are those top 3 also be the most`job` type who subscribed for term deposit product?

Question 3: What type of `education` who subscribed term deposit the most?

Question 4: What about `job` dan `education` variables together, who subscribed term deposit the most?

Question 6: How many customers who have `balance` > 200USD subscribed term deposit?

Question 7: We want to know the average of contacts of the previous campaign (`previous`) which have “success” outcome (`poutcome`) compared with last campaign.