EDA - Bank Marketing

0.1 About Data

The data Bank Marketing that we’re going to explore is provided at https://archive.ics.uci.edu/ml/datasets/bank+marketing. The data records marketing campaigns through phone call at one of local bank in Portuguese.

Input

First of all, use read.csv() from baseR to read bank-full.csv that has been saved in local folder. Define it as bank

bank <- read.csv("data_input/bank-full.csv", sep = ";")
head(bank, 3)

Inspect

dim(bank)

## [1] 45211    17

The dimention of bank data set is 45211 rows and 17 columns to be observed.

names(bank)

##  [1] "age"       "job"       "marital"   "education" "default"   "balance"  
##  [7] "housing"   "loan"      "contact"   "day"       "month"     "duration" 
## [13] "campaign"  "pdays"     "previous"  "poutcome"  "y"

It has 17 variables which explains:

age : (numeric)
job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’)
marital : marital status (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed)
education (categorical: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’)
default: has credit in default? (categorical: ‘no’,‘yes’,‘unknown’)
balance: (numeric)
housing: has housing loan? (categorical: ‘no’,‘yes’,‘unknown’)
loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)
contact: contact communication type (categorical: ‘cellular’,‘telephone’)
day_of_week: last contact day of the week (categorical: ‘mon’,‘tue’,‘wed’,‘thu’,‘fri’)
month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). (Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.)
campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
previous: number of contacts performed before this campaign and for this client (numeric)
poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’) Output variable (desired target):
y : has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

Now let’s inspect the type and some characteristics of each variable or column.

str(bank)

## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : chr  "management" "technician" "entrepreneur" "blue-collar" ...
##  $ marital  : chr  "married" "single" "married" "married" ...
##  $ education: chr  "tertiary" "secondary" "secondary" "unknown" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : chr  "yes" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "no" "yes" "no" ...
##  $ contact  : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : chr  "may" "may" "may" "may" ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...

AnyNA

anyNA(bank)

## [1] FALSE

0.2 Data Transformation

In data preparation process, one of the most essential step to do is transforming some variables type into the right one (as described above). A proper data transformation must be exceeded to make it easier for us on wrangling and visualization process later on.

bank$job <- as.factor(bank$job)
bank$marital <- as.factor(bank$marital)
bank$education <- as.factor(bank$education)
bank$default <- as.factor(bank$default)
bank$housing <- as.factor(bank$housing)
bank$loan <- as.factor(bank$loan)
bank$day <- as.factor(bank$day)
bank$month <- as.factor(bank$month)
bank$poutcome <- as.factor(bank$poutcome)
bank$y <- as.factor(bank$y)

str(bank)

## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ day      : Factor w/ 31 levels "1","2","3","4",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Datetime Variable

If you see variables day and month, they supposed to be declared as datetime right? Why is day known as factor? And so is month? Let’s take a look.

length(unique(bank$day))

## [1] 31

length(unique(bank$month))

## [1] 12

Each of those variables has date information but in two separated column. A quick note for us, that every datetime variable must be concists of at-least-two information of datetime in the same column. For example: there are 3 observations : Jan-05, Mar-31, or 23-06-2014 in a column named date.
Then that date column can be transformed into datetime type variable. For our case, it is impossible to change day and monthinto datetime so we can do some alternative steps. First, create new column as monthdate which extracts date information from those two columns.

bank$monthdate <- as.Date(paste(bank$month,bank$day, sep = "-"), format = "%b-%d")
head(bank)

Voila! Now we have monthdate as the new column that tells us about date information.

0.3 Wrangling & Exploratory Data

After doing data transformation and preparation, we must define some target or problem that we are going to solve as this analyzing progress goes on. This step will lead us into the final result as we want. For this bank data set, we want to analyze the characteristic of target market of this bank, from variables: job, marital, education, age, balance, loan, housing by using table()/aggregate().
Before that, create new column rangeage to categorise each customer into a categorical variable. We can wrangling the data by using function case_when() from library(dplyr)

bank$rangeage <- as.factor(case_when(
  bank$age < 23 ~ "18-22",
  bank$age > 22 & bank$age < 36 ~ "23-35",
  bank$age > 35 & bank$age < 51 ~ "36-50",
  TRUE ~ "above 50"
  ))
head(bank)

sort(round(prop.table(table(bank$job))*100, 2), decreasing = T)

## 
##   blue-collar    management    technician        admin.      services 
##         21.53         20.92         16.80         11.44          9.19 
##       retired self-employed  entrepreneur    unemployed     housemaid 
##          5.01          3.49          3.29          2.88          2.74 
##       student       unknown 
##          2.07          0.64

sort(round(prop.table(table(bank$marital))*100, 2), decreasing = T)

## 
##  married   single divorced 
##    60.19    28.29    11.52

sort(round(prop.table(table(bank$education))*100, 2), decreasing = T)

## 
## secondary  tertiary   primary   unknown 
##     51.32     29.42     15.15      4.11

sort(round(prop.table(table(bank$rangeage))*100, 2), decreasing = T)

## 
##    36-50    23-35 above 50    18-22 
##    42.13    36.72    20.47     0.67

round(prop.table(table(bank$job, bank$marital))*100, 2)

##                
##                 divorced married single
##   admin.            1.66    5.96   3.82
##   blue-collar       1.66   15.41   4.45
##   entrepreneur      0.40    2.37   0.53
##   housemaid         0.41    2.02   0.32
##   management        2.46   11.94   6.52
##   retired           0.94    3.83   0.24
##   self-employed     0.31    2.20   0.99
##   services          1.21    5.32   2.65
##   student           0.01    0.12   1.94
##   technician        2.05    8.96   5.80
##   unemployed        0.38    1.62   0.89
##   unknown           0.04    0.45   0.15

round(prop.table(table(bank$job, bank$education))*100, 2)

##                
##                 primary secondary tertiary unknown
##   admin.           0.46      9.33     1.27    0.38
##   blue-collar      8.31     11.88     0.33    1.00
##   entrepreneur     0.40      1.20     1.52    0.17
##   housemaid        1.39      0.87     0.38    0.10
##   management       0.65      2.48    17.25    0.54
##   retired          1.76      2.18     0.81    0.26
##   self-employed    0.29      1.28     1.84    0.09
##   services         0.76      7.65     0.45    0.33
##   student          0.10      1.12     0.49    0.36
##   technician       0.35     11.57     4.35    0.54
##   unemployed       0.57      1.61     0.64    0.06
##   unknown          0.11      0.16     0.09    0.28

round(prop.table(table(bank$job, bank$loan))*100, 2)

##                
##                    no   yes
##   admin.         9.25  2.19
##   blue-collar   17.80  3.72
##   entrepreneur   2.50  0.79
##   housemaid      2.41  0.34
##   management    18.15  2.77
##   retired        4.32  0.68
##   self-employed  2.99  0.51
##   services       7.34  1.85
##   student        2.05  0.03
##   technician    13.91  2.90
##   unemployed     2.64  0.24
##   unknown        0.63  0.01

round(prop.table(table(bank$job, bank$housing))*100, 2)

##                
##                    no   yes
##   admin.         4.40  7.04
##   blue-collar    5.94 15.59
##   entrepreneur   1.37  1.92
##   housemaid      1.86  0.88
##   management    10.57 10.35
##   retired        3.92  1.09
##   self-employed  1.80  1.69
##   services       3.07  6.12
##   student        1.52  0.55
##   technician     7.70  9.10
##   unemployed     1.68  1.20
##   unknown        0.58  0.06

summary(bank$balance)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -8019      72     448    1362    1428  102127

From the summary above, we know that some customers have minus balance value. Define that into new data named minus.

minus <- bank[bank$balance < 0, ]

aggregate(formula = balance ~ job, data = bank, FUN = mean)

aggregate(formula = balance ~ job, data = minus, FUN = mean)

data <- xtabs( ~ poutcome + job, bank)
data

##          job
## poutcome  admin. blue-collar entrepreneur housemaid management retired
##   failure    643        1068          164        97       1044     230
##   other      241         401           39        26        393      83
##   success    204         148           22        29        387     174
##   unknown   4083        8115         1262      1088       7634    1777
##          job
## poutcome  self-employed services student technician unemployed unknown
##   failure           165      444     117        797        112      20
##   other              61      177      81        299         37       2
##   success            55       85      87        245         64      11
##   unknown          1298     3448     653       6256       1090     255

heatmap(data, Rowv = NA, Colv = NA, cexCol = 1, cexRow = 1, scale = "row")

success <- bank[bank$poutcome == "success",]
success$month <- factor(success$month, levels = c("jan", "feb", "mar","apr","may","jun","jul","aug","sep","oct","nov","dec"))
success_xtabs <- xtabs(~month + job, data = success )
success_xtabs

##      job
## month admin. blue-collar entrepreneur housemaid management retired
##   jan      7           5            1         2         10      16
##   feb     18          14            1         3         32      17
##   mar      5           4            0         1         23      11
##   apr     15          16            2         1         35      11
##   may     37          43            3         5         53      18
##   jun     17          10            3         3         35       7
##   jul     17           7            0         1         19      20
##   aug     29          13            4         3         46      20
##   sep     23           9            1         1         48      15
##   oct     13           8            2         4         36      19
##   nov     13          17            4         4         38      14
##   dec     10           2            1         1         12       6
##      job
## month self-employed services student technician unemployed unknown
##   jan             2        5       4          8          3       1
##   feb             6        5      15         23          8       1
##   mar             2        4       4          3          3       0
##   apr             6        6       6         27          3       1
##   may             6       16      10         38          6       1
##   jun             3       11       2         23          3       1
##   jul             5        5       6         12          7       0
##   aug             8       10      15         47         10       2
##   sep             4        5       6         12          7       1
##   oct             6        2      10         24          5       2
##   nov             6        8       7         22          7       1
##   dec             1        8       2          6          2       0

heatmap(success_xtabs, Rowv = NA, Colv = NA, scale = "row")

subscribe <- xtabs(~ job+y, data = bank)
heatmap(x = subscribe, Rowv = NA, Colv = NA, cexCol = 1, scale = "column")

0.4 Data Analysis

By doing data wrangling and exploratory, we can get some information about market behaviour and characteristic of this bank. First, the class of job such as entrepreneur, umployed, housemaid, student, admin, etc has shown blue-collar (21.53%), management (20.92%), and technician(16.80%) on top three. While the marital variable tells that mostly blue-collar and management segment had married. Blue-collar and management have a contrary in education class, which more than 15% of management segment has tertiary education history and blue-collar is less than 0.5%, dominated in secondary class education.

If we see the summary of balance (mean), we’ll notice that actually management segment has the biggest contribution in deficit balance (-) but also in top 5 for the highest amount of balance.

Now, let’s analyze poutcome variable which explains the outcome of previous marketing campaign. It can show probability of target market whether they will subscribe a term deposit or not. According to heatmap, management segment generates the highest rate of success, it also dominated in other categories of poutcome along a year, significantly in March, September, and November. While blue-collar segment doesn’t really show big impact for succesfull rate market. On the other side, technician contributes accountable success rate in August and stable performance in every category of poutcome.

Conclusion:
It is important to strengthen the exposure for target market at the right time, reconsider the capability of each segment whether they’d subscribe a term deposit or not. For example, blue-collar segment has a big number of unsubscribed deposit while management segment looks promising. So, it will be a wise desicion to put this market segment into top priority.