Import Data

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls.

bank <- read.csv("bank-full.csv",sep = ";")

Data Inspection

head(bank, 5)
dim(bank)
## [1] 45211    17

From our inspection we can conclude :

  • bank data contain 45211 rows and 17 columns
  • each columns names: “age”, “job”, “marital”, “education”, “default”, “balance”, “housing”, “loan”
    “contact”, “day”, “month”, “duration”, “campaign”, “pdays”, “previous”, “poutcome”, “y”

Data Structure & NA Checking

str(bank)
## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : chr  "management" "technician" "entrepreneur" "blue-collar" ...
##  $ marital  : chr  "married" "single" "married" "married" ...
##  $ education: chr  "tertiary" "secondary" "secondary" "unknown" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : chr  "yes" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "no" "yes" "no" ...
##  $ contact  : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : chr  "may" "may" "may" "may" ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...

Based on STR, I want to change some columns into factor

bank$job <- as.factor(bank$job)
bank$marital <- as.factor(bank$marital)
bank$education <- as.factor(bank$education)
bank$default <- as.factor(bank$default)
bank$housing <- as.factor(bank$housing)
bank$loan <- as.factor(bank$loan)
bank$contact <- as.factor(bank$contact)
bank$month <- as.factor(bank$month)
bank$poutcome <- as.factor(bank$poutcome)
bank$y <- as.factor(bank$y)

str(bank)
## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

After that, I want to check if there is any NA value in the data.

colSums(is.na(bank))
##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0
anyNA(bank)
## [1] FALSE

Based on the result, the data don’t have any NA values, so we can continue to explore the data

Data Wrangling & Exploration

I want to know more about education degree of our clients

data.frame(sort(table("education" = bank$education),decreasing = T))

I want to know more about loan condition of our clients

data.frame(table("loan" = bank$loan, 
      "housing" = bank$housing,
      "default" = bank$default))

I want to know marital status of our clients

data.frame(sort(table("marital_status" = bank$marital),decreasing = T))

I want to know more about how to contact our clients

data.frame(sort(table("contact" = bank$contact),decreasing = T))

I want to know how many clients have not been contacted

data.frame(sort(table("contacted" = bank$previous != 0),decreasing = T))

Based on the data, we know that:

  • Education degree of our clients mainly secondary and tertiary
  • Half of our clients already have housing loan
  • Majority of our clients already married
  • Majority of our clients can be contacted by cellular
  • Majority of our clients have not been contacted for loan program

Business Case

Despite majority of our clients already have housing loan, many of them don’t

For this reason, I want to create a new program for loan.

But first, I have to explore the data about job and age distribution.

More Data Wrangling & Exploration

To see job description and the distribution

data.frame(sort(table("job" = bank$job),decreasing = T))

To see age distribution

data.frame(sort(table("age" = bank$age),decreasing = T))

Based on the data, I want to know how many clients we can prospect to join the loan program. The requirements:

  • age between 30 and 50
  • work as management or blue-collar or technician or admin. or services
  • don’t have loan, housing loan, and default credit.
  • can be contacted

Condition

Condition for subsetting

age_cond <- bank$age == 30:60
job_cond <- bank$job %in% c("management", "blue-collar", "technician", "admin.", "services")
loan_cond <- bank$loan == "no" & bank$housing == "no" & bank$default == "no"
cont_cond <- bank$contact != "unknown"

Subsetting

Subsetting to get new dataframe

prog_cond <- bank[age_cond & job_cond & loan_cond & cont_cond, ]
prog_cond

Based the data, we have 278 clients to be offered.