knitr::include_graphics("bank.jpg")

1. Explanation

The Dataset is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

2. Dataset Inspection

The first step to import dataset is read the .csv file dan import library (dplyr) to wrangling the data

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
bank <- read.csv("bank.csv",sep = ";")
glimpse(bank)
## Rows: 4,521
## Columns: 17
## $ age       <int> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 31, …
## $ job       <chr> "unemployed", "services", "management", "management", "blue-…
## $ marital   <chr> "married", "married", "single", "married", "married", "singl…
## $ education <chr> "primary", "secondary", "tertiary", "tertiary", "secondary",…
## $ default   <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
## $ balance   <int> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374, 26…
## $ housing   <chr> "no", "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes",…
## $ loan      <chr> "no", "yes", "no", "yes", "no", "no", "no", "no", "no", "yes…
## $ contact   <chr> "cellular", "cellular", "cellular", "unknown", "unknown", "c…
## $ day       <int> 19, 11, 16, 3, 5, 23, 14, 6, 14, 17, 20, 17, 13, 30, 29, 29,…
## $ month     <chr> "oct", "may", "apr", "jun", "may", "feb", "may", "may", "may…
## $ duration  <int> 79, 220, 185, 199, 226, 141, 341, 151, 57, 313, 273, 113, 32…
## $ campaign  <int> 1, 1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 5, 1, 1, 1, …
## $ pdays     <int> -1, 339, 330, -1, -1, 176, 330, -1, -1, 147, -1, -1, -1, -1,…
## $ previous  <int> 0, 4, 1, 0, 0, 3, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, 1, …
## $ poutcome  <chr> "unknown", "failure", "failure", "unknown", "unknown", "fail…
## $ y         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …

This dataset consist 4521 objects and 17 variables. check the first 6 row from above

head(bank)
tail(bank)

3. Cleansing Dataset

In order to get an accurate and clean dataset, we have to clean a dataset first by considering the type of variables. For instance, the variable of job have a character format, so we have to change into factor to have a categorical levels. By using as.factor syntax.

bank$contact <- as.factor(bank$contact)
bank$education <- as.factor(bank$education)
bank$default <- as.factor(bank$default)
bank$housing <- as.factor(bank$housing)
bank$loan <- as.factor(bank$loan)
bank$poutcome <- as.factor(bank$poutcome)
bank$month <- as.factor(bank$month)
bank$y <- as.factor(bank$y)
bank$marital <- as.factor(bank$marital)
bank$job <- as.factor(bank$job)
bank$day <- as.factor(bank$day)
bank$campaign <- as.factor(bank$campaign)
bank$age <- as.numeric(bank$age)
glimpse(bank)
## Rows: 4,521
## Columns: 17
## $ age       <dbl> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 31, …
## $ job       <fct> unemployed, services, management, management, blue-collar, m…
## $ marital   <fct> married, married, single, married, married, single, married,…
## $ education <fct> primary, secondary, tertiary, tertiary, secondary, tertiary,…
## $ default   <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
## $ balance   <int> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374, 26…
## $ housing   <fct> no, yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, no…
## $ loan      <fct> no, yes, no, yes, no, no, no, no, no, yes, no, no, no, no, y…
## $ contact   <fct> cellular, cellular, cellular, unknown, unknown, cellular, ce…
## $ day       <fct> 19, 11, 16, 3, 5, 23, 14, 6, 14, 17, 20, 17, 13, 30, 29, 29,…
## $ month     <fct> oct, may, apr, jun, may, feb, may, may, may, apr, may, apr, …
## $ duration  <int> 79, 220, 185, 199, 226, 141, 341, 151, 57, 313, 273, 113, 32…
## $ campaign  <fct> 1, 1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 5, 1, 1, 1, …
## $ pdays     <int> -1, 339, 330, -1, -1, 176, 330, -1, -1, 147, -1, -1, -1, -1,…
## $ previous  <int> 0, 4, 1, 0, 0, 3, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, 1, …
## $ poutcome  <fct> unknown, failure, failure, unknown, unknown, failure, other,…
## $ y         <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, yes, no,…

After cleansing the dataset, the type of data change and ready to analyze

summary(bank)
##       age                 job          marital         education    default   
##  Min.   :19.00   management :969   divorced: 528   primary  : 678   no :4445  
##  1st Qu.:33.00   blue-collar:946   married :2797   secondary:2306   yes:  76  
##  Median :39.00   technician :768   single  :1196   tertiary :1350             
##  Mean   :41.17   admin.     :478                   unknown  : 187             
##  3rd Qu.:49.00   services   :417                                              
##  Max.   :87.00   retired    :230                                              
##                  (Other)    :713                                              
##     balance      housing     loan           contact          day      
##  Min.   :-3313   no :1962   no :3830   cellular :2896   20     : 257  
##  1st Qu.:   69   yes:2559   yes: 691   telephone: 301   18     : 226  
##  Median :  444                         unknown  :1324   19     : 201  
##  Mean   : 1423                                          21     : 198  
##  3rd Qu.: 1480                                          14     : 195  
##  Max.   :71188                                          17     : 191  
##                                                         (Other):3253  
##      month         duration       campaign        pdays       
##  may    :1398   Min.   :   4   1      :1734   Min.   : -1.00  
##  jul    : 706   1st Qu.: 104   2      :1264   1st Qu.: -1.00  
##  aug    : 633   Median : 185   3      : 558   Median : -1.00  
##  jun    : 531   Mean   : 264   4      : 325   Mean   : 39.77  
##  nov    : 389   3rd Qu.: 329   5      : 167   3rd Qu.: -1.00  
##  apr    : 293   Max.   :3025   6      : 155   Max.   :871.00  
##  (Other): 571                  (Other): 318                   
##     previous          poutcome      y       
##  Min.   : 0.0000   failure: 490   no :4000  
##  1st Qu.: 0.0000   other  : 197   yes: 521  
##  Median : 0.0000   success: 129             
##  Mean   : 0.5426   unknown:3705             
##  3rd Qu.: 0.0000                            
##  Max.   :25.0000                            
## 

Summary of Bank Marketing Dataset:

  1. The age has a range from 19 yo until 87 yo

  2. The top 3 of occupation are management, blue-collar and technician

Check the histogram for customer based on their age

hist(bank$age)

Insight: From the Histogram above, the interval range are mostly from 20 yo and 60 yo, that means it’s a good move to make more campaign in their rise age.

Check the dataset if there are missing values

anyNA(bank)
## [1] FALSE
colSums(is.na(bank))
##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0

check the first 10 of dataset based on age, marital and job

bank[1:10,c("age","marital","job")]

4. Sorting

Based on my opinion, according to get a potential target, we have to set several requirment with balance more 20000 and the people who are using cellular phone, which means a modern people and hi tech tools, it should be have more accurate of potential market. using filter at the row of dataset

bank[bank$balance > 20000 & bank$contact == "cellular",]

5. aggregate

aggregate(x= balance ~ job, data = bank, FUN = mean, decreasing = T)

6. Including Plots

You can also embed plots, for example:

From scatter plot above, can be determined that most people have a balance below 20.000, only 1 person who has balance above 60.000

plot(xtabs(balance ~ contact+y,bank))

From xtabs above can be concluded a celullar user are the most frequently use compare the other, then the amount of customers who are closing is less than

7. Reference

https://archive.ics.uci.edu/dataset/222/bank+marketing