Introduction

Exploratory Data Analysis refers to the critical process of performing initial investigations on data to discover patterns, spot anomalies, test hypotheses, and check assumptions with the help of summary statistics and graphical representations. Understanding the data with basic statistics will give us a tremendous advantage in making the best business decision. The data is downloaded from https://archive.ics.uci.edu/ml/datasets/bank+marketing

Reading The Data

bank <- read.csv(file ="banknew.csv")
rmarkdown::paged_table(bank)

Then check the structure and types of dataset

dim(bank)
## [1] 4521   17
str(bank)
## 'data.frame':    4521 obs. of  17 variables:
##  $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
##  $ job      : chr  "unemployed" "services" "management" "management" ...
##  $ marital  : chr  "married" "married" "single" "married" ...
##  $ education: chr  "primary" "secondary" "tertiary" "tertiary" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
##  $ housing  : chr  "no" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "yes" "no" "yes" ...
##  $ contact  : chr  "cellular" "cellular" "cellular" "unknown" ...
##  $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
##  $ month    : chr  "oct" "may" "apr" "jun" ...
##  $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
##  $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
##  $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome : chr  "unknown" "failure" "failure" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...

The bank dataset has 4521 rows and 17 columns, the name of columns as follow

names(bank)
##  [1] "age"       "job"       "marital"   "education" "default"   "balance"  
##  [7] "housing"   "loan"      "contact"   "day"       "month"     "duration" 
## [13] "campaign"  "pdays"     "previous"  "poutcome"  "y"

Data Wrangling

Change the type of data

From the columns above, we can change the types of data, as follow

bank$job <- as.factor(bank$job)
bank$marital<- as.factor(bank$marital)
bank$education <- as.factor(bank$education)
bank$contact <- as.factor(bank$contact)
bank$poutcome <- as.factor(bank$poutcome)
bank$housing <- as.factor(bank$housing)
bank$loan <- as.factor(bank$loan)
bank$month <- as.factor(bank$month)

Then we can see, the type has changed

str(bank)
## 'data.frame':    4521 obs. of  17 variables:
##  $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
##  $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
##  $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
##  $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
##  $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
##  $ y        : chr  "no" "no" "no" "no" ...

The dataset includes information as follow:

  1. Age numeric
  2. Job factor : type of job
  3. Marital factor : status of marriage
  4. Education factor : level of education
  5. Default factor : has credit in default?
  6. Housing factor: has housing loan?
  7. Loan factor: has personal loan?
  8. Contact factor: communication type
  9. Month factor: last contact month of year
  10. Day_of_week factor : last contact day of the week
  11. Duration numeric: last contact duration, in seconds (numeric).
  12. Campaign numeric: number of contacts performed during this campaign and for this client
  13. Pdays numeric: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  14. Previous numeric: number of contacts performed before this campaign and for this client (numeric)
  15. Poutcome factor: outcome of the previous marketing campaign (categorical)
  16. Balance numeric: average yearly balance
  17. y has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

Calculating the statistical value from the data set

summary(bank)
##       age                 job          marital         education   
##  Min.   :19.00   management :969   divorced: 528   primary  : 678  
##  1st Qu.:33.00   blue-collar:946   married :2797   secondary:2306  
##  Median :39.00   technician :768   single  :1196   tertiary :1350  
##  Mean   :41.17   admin.     :478                   unknown  : 187  
##  3rd Qu.:49.00   services   :417                                   
##  Max.   :87.00   retired    :230                                   
##                  (Other)    :713                                   
##    default             balance      housing     loan           contact    
##  Length:4521        Min.   :-3313   no :1962   no :3830   cellular :2896  
##  Class :character   1st Qu.:   69   yes:2559   yes: 691   telephone: 301  
##  Mode  :character   Median :  444                         unknown  :1324  
##                     Mean   : 1423                                         
##                     3rd Qu.: 1480                                         
##                     Max.   :71188                                         
##                                                                           
##       day            month         duration       campaign     
##  Min.   : 1.00   may    :1398   Min.   :   4   Min.   : 1.000  
##  1st Qu.: 9.00   jul    : 706   1st Qu.: 104   1st Qu.: 1.000  
##  Median :16.00   aug    : 633   Median : 185   Median : 2.000  
##  Mean   :15.92   jun    : 531   Mean   : 264   Mean   : 2.794  
##  3rd Qu.:21.00   nov    : 389   3rd Qu.: 329   3rd Qu.: 3.000  
##  Max.   :31.00   apr    : 293   Max.   :3025   Max.   :50.000  
##                  (Other): 571                                  
##      pdays           previous          poutcome         y            
##  Min.   : -1.00   Min.   : 0.0000   failure: 490   Length:4521       
##  1st Qu.: -1.00   1st Qu.: 0.0000   other  : 197   Class :character  
##  Median : -1.00   Median : 0.0000   success: 129   Mode  :character  
##  Mean   : 39.77   Mean   : 0.5426   unknown:3705                     
##  3rd Qu.: -1.00   3rd Qu.: 0.0000                                    
##  Max.   :871.00   Max.   :25.0000                                    
## 

Check Missing Value

Check the missing value from dataset

colSums(is.na(bank))
##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0
anyNA(bank)
## [1] FALSE

There are no missing value from the dataset

Data Agregation

We want to know what kind of customer job that have the maximum balance

bank1 <- aggregate(formula= balance~job, data= bank, FUN= sum)
bank1[order(bank1$balance, decreasing = T),]
##              job balance
## 5     management 1712154
## 2    blue-collar 1026563
## 10    technician 1022205
## 1         admin.  586380
## 6        retired  533414
## 8       services  460350
## 3   entrepreneur  276381
## 7  self-employed  254811
## 4      housemaid  233386
## 11    unemployed  139446
## 9        student  129681
## 12       unknown   57065

The customer who working in management have the highest balance in the bank

Now we want to know the age of customer based on their job

bank$age.Range <-c("18-32", "33-39", "40-47", "48-95")[findInterval(as.numeric(as.character(bank$age)) , c(18, 33, 40, 48, Inf) )]
xtabs(~ bank$age.Range + bank$job, bank)
##               bank$job
## bank$age.Range admin. blue-collar entrepreneur housemaid management retired
##          18-32    126         216           20        12        207       1
##          33-39    136         275           53        19        307       3
##          40-47    114         230           46        22        216       4
##          48-95    102         225           49        59        239     222
##               bank$job
## bank$age.Range self-employed services student technician unemployed unknown
##          18-32            37      136      73        189         35       3
##          33-39            50      110       7        245         24       6
##          40-47            46       84       4        174         30       7
##          48-95            50       87       0        160         39      22

from this graphics we can see the range of age based on the job from customer

graphics::barplot(xtabs(~ age.Range + job, bank), legend.text = TRUE)

From the boxplot below we can see relation between educational level and the age of customer

plot(x =bank$education, y=bank$age)

From the pie chart below we can see the relation between marital status and bank loan, it explain that the married customer have a large portion that divorced and single, that means the married customer have the biggest loan in bank

graphics::pie(xtabs(~ bank$marital, bank$loan))

Then we want to know how the customer make a call to communicate with the bank from the contact they used. There are three types of contact such as cellular, telephone and unknown. The graphics below show us the length of duration between the bank and customer.

ggplot(data = bank, 
       mapping = aes(x = duration,
                     y = reorder(contact, duration))) +
  geom_col(mapping = aes(fill = contact)) +
  # untuk menambahkan label
  
  # untuk custom label pada x,y,title,etc
  labs(x = "Duration in second",
       y = NULL,
       title = "Call duration and types of contact",
       subtitle = NULL,
       fill = "Contact") + # custom legend title untuk fitur warna (fill)
  
  
  theme_economist_white()

The customer with cellular contact have many call duration with the bank

Conclusion

We have a few insights regarding the data. The target customer may come from the age ranging from 33 - 39 and should be called by cellular. Then the married customer maybe will need any loan product from the bank, so if the bank want to make a promotion of their product loan, the primary target can be a customer with marital status married

As we know, the data provide much more insight that has been dug. We have not discovered other variables that may affect a customer’s decision to participate in the campaign. Further analysis should be done to take insight genuinely; for example, we may run a machine learning model for more careful consideration.