Exploratory Data Analysis refers to the critical process of performing initial investigations on data to discover patterns, spot anomalies, test hypotheses, and check assumptions with the help of summary statistics and graphical representations. Understanding the data with basic statistics will give us a tremendous advantage in making the best business decision. The data is downloaded from https://archive.ics.uci.edu/ml/datasets/bank+marketing
bank <- read.csv(file ="banknew.csv")
rmarkdown::paged_table(bank)
Then check the structure and types of dataset
dim(bank)
## [1] 4521 17
str(bank)
## 'data.frame': 4521 obs. of 17 variables:
## $ age : int 30 33 35 30 59 35 36 39 41 43 ...
## $ job : chr "unemployed" "services" "management" "management" ...
## $ marital : chr "married" "married" "single" "married" ...
## $ education: chr "primary" "secondary" "tertiary" "tertiary" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
## $ housing : chr "no" "yes" "yes" "yes" ...
## $ loan : chr "no" "yes" "no" "yes" ...
## $ contact : chr "cellular" "cellular" "cellular" "unknown" ...
## $ day : int 19 11 16 3 5 23 14 6 14 17 ...
## $ month : chr "oct" "may" "apr" "jun" ...
## $ duration : int 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
## $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
## $ previous : int 0 4 1 0 0 3 2 0 0 2 ...
## $ poutcome : chr "unknown" "failure" "failure" "unknown" ...
## $ y : chr "no" "no" "no" "no" ...
The bank dataset has 4521 rows and 17 columns, the name of columns as follow
names(bank)
## [1] "age" "job" "marital" "education" "default" "balance"
## [7] "housing" "loan" "contact" "day" "month" "duration"
## [13] "campaign" "pdays" "previous" "poutcome" "y"
From the columns above, we can change the types of data, as follow
bank$job <- as.factor(bank$job)
bank$marital<- as.factor(bank$marital)
bank$education <- as.factor(bank$education)
bank$contact <- as.factor(bank$contact)
bank$poutcome <- as.factor(bank$poutcome)
bank$housing <- as.factor(bank$housing)
bank$loan <- as.factor(bank$loan)
bank$month <- as.factor(bank$month)
Then we can see, the type has changed
str(bank)
## 'data.frame': 4521 obs. of 17 variables:
## $ age : int 30 33 35 30 59 35 36 39 41 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
## $ housing : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
## $ day : int 19 11 16 3 5 23 14 6 14 17 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
## $ duration : int 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
## $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
## $ previous : int 0 4 1 0 0 3 2 0 0 2 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
## $ y : chr "no" "no" "no" "no" ...
The dataset includes information as follow:
numericfactor : type of jobfactor : status of marriagefactor : level of educationfactor : has credit in default?factor: has housing loan?factor: has personal loan?factor: communication typefactor: last contact month of yearfactor : last contact day of the weeknumeric: last contact duration, in seconds (numeric).numeric: number of contacts performed during this campaign and for this clientnumeric: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)numeric: number of contacts performed before this campaign and for this client (numeric)factor: outcome of the previous marketing campaign (categorical)numeric: average yearly balancesummary(bank)
## age job marital education
## Min. :19.00 management :969 divorced: 528 primary : 678
## 1st Qu.:33.00 blue-collar:946 married :2797 secondary:2306
## Median :39.00 technician :768 single :1196 tertiary :1350
## Mean :41.17 admin. :478 unknown : 187
## 3rd Qu.:49.00 services :417
## Max. :87.00 retired :230
## (Other) :713
## default balance housing loan contact
## Length:4521 Min. :-3313 no :1962 no :3830 cellular :2896
## Class :character 1st Qu.: 69 yes:2559 yes: 691 telephone: 301
## Mode :character Median : 444 unknown :1324
## Mean : 1423
## 3rd Qu.: 1480
## Max. :71188
##
## day month duration campaign
## Min. : 1.00 may :1398 Min. : 4 Min. : 1.000
## 1st Qu.: 9.00 jul : 706 1st Qu.: 104 1st Qu.: 1.000
## Median :16.00 aug : 633 Median : 185 Median : 2.000
## Mean :15.92 jun : 531 Mean : 264 Mean : 2.794
## 3rd Qu.:21.00 nov : 389 3rd Qu.: 329 3rd Qu.: 3.000
## Max. :31.00 apr : 293 Max. :3025 Max. :50.000
## (Other): 571
## pdays previous poutcome y
## Min. : -1.00 Min. : 0.0000 failure: 490 Length:4521
## 1st Qu.: -1.00 1st Qu.: 0.0000 other : 197 Class :character
## Median : -1.00 Median : 0.0000 success: 129 Mode :character
## Mean : 39.77 Mean : 0.5426 unknown:3705
## 3rd Qu.: -1.00 3rd Qu.: 0.0000
## Max. :871.00 Max. :25.0000
##
Check the missing value from dataset
colSums(is.na(bank))
## age job marital education default balance housing loan
## 0 0 0 0 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 0 0 0 0 0 0 0 0
## y
## 0
anyNA(bank)
## [1] FALSE
There are no missing value from the dataset
We want to know what kind of customer job that have the maximum balance
bank1 <- aggregate(formula= balance~job, data= bank, FUN= sum)
bank1[order(bank1$balance, decreasing = T),]
## job balance
## 5 management 1712154
## 2 blue-collar 1026563
## 10 technician 1022205
## 1 admin. 586380
## 6 retired 533414
## 8 services 460350
## 3 entrepreneur 276381
## 7 self-employed 254811
## 4 housemaid 233386
## 11 unemployed 139446
## 9 student 129681
## 12 unknown 57065
The customer who working in management have the highest balance in the bank
Now we want to know the age of customer based on their job
bank$age.Range <-c("18-32", "33-39", "40-47", "48-95")[findInterval(as.numeric(as.character(bank$age)) , c(18, 33, 40, 48, Inf) )]
xtabs(~ bank$age.Range + bank$job, bank)
## bank$job
## bank$age.Range admin. blue-collar entrepreneur housemaid management retired
## 18-32 126 216 20 12 207 1
## 33-39 136 275 53 19 307 3
## 40-47 114 230 46 22 216 4
## 48-95 102 225 49 59 239 222
## bank$job
## bank$age.Range self-employed services student technician unemployed unknown
## 18-32 37 136 73 189 35 3
## 33-39 50 110 7 245 24 6
## 40-47 46 84 4 174 30 7
## 48-95 50 87 0 160 39 22
from this graphics we can see the range of age based on the job from customer
graphics::barplot(xtabs(~ age.Range + job, bank), legend.text = TRUE)
From the boxplot below we can see relation between educational level and the age of customer
plot(x =bank$education, y=bank$age)
From the pie chart below we can see the relation between marital status and bank loan, it explain that the married customer have a large portion that divorced and single, that means the married customer have the biggest loan in bank
graphics::pie(xtabs(~ bank$marital, bank$loan))
Then we want to know how the customer make a call to communicate with the bank from the contact they used. There are three types of contact such as cellular, telephone and unknown. The graphics below show us the length of duration between the bank and customer.
ggplot(data = bank,
mapping = aes(x = duration,
y = reorder(contact, duration))) +
geom_col(mapping = aes(fill = contact)) +
# untuk menambahkan label
# untuk custom label pada x,y,title,etc
labs(x = "Duration in second",
y = NULL,
title = "Call duration and types of contact",
subtitle = NULL,
fill = "Contact") + # custom legend title untuk fitur warna (fill)
theme_economist_white()
The customer with cellular contact have many call duration with the bank
We have a few insights regarding the data. The target customer may come from the age ranging from 33 - 39 and should be called by cellular. Then the married customer maybe will need any loan product from the bank, so if the bank want to make a promotion of their product loan, the primary target can be a customer with marital status married
As we know, the data provide much more insight that has been dug. We have not discovered other variables that may affect a customer’s decision to participate in the campaign. Further analysis should be done to take insight genuinely; for example, we may run a machine learning model for more careful consideration.