knitr::include_graphics("bank.jpg")
The Dataset is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).
The first step to import dataset is read the .csv file dan import library (dplyr) to wrangling the data
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
bank <- read.csv("bank.csv",sep = ";")
glimpse(bank)
## Rows: 4,521
## Columns: 17
## $ age <int> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 31, …
## $ job <chr> "unemployed", "services", "management", "management", "blue-…
## $ marital <chr> "married", "married", "single", "married", "married", "singl…
## $ education <chr> "primary", "secondary", "tertiary", "tertiary", "secondary",…
## $ default <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
## $ balance <int> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374, 26…
## $ housing <chr> "no", "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes",…
## $ loan <chr> "no", "yes", "no", "yes", "no", "no", "no", "no", "no", "yes…
## $ contact <chr> "cellular", "cellular", "cellular", "unknown", "unknown", "c…
## $ day <int> 19, 11, 16, 3, 5, 23, 14, 6, 14, 17, 20, 17, 13, 30, 29, 29,…
## $ month <chr> "oct", "may", "apr", "jun", "may", "feb", "may", "may", "may…
## $ duration <int> 79, 220, 185, 199, 226, 141, 341, 151, 57, 313, 273, 113, 32…
## $ campaign <int> 1, 1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 5, 1, 1, 1, …
## $ pdays <int> -1, 339, 330, -1, -1, 176, 330, -1, -1, 147, -1, -1, -1, -1,…
## $ previous <int> 0, 4, 1, 0, 0, 3, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, 1, …
## $ poutcome <chr> "unknown", "failure", "failure", "unknown", "unknown", "fail…
## $ y <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
This dataset consist 4521 objects and 17 variables. check the first 6 row from above
head(bank)
tail(bank)
In order to get an accurate and clean dataset, we have to clean a dataset first by considering the type of variables. For instance, the variable of job have a character format, so we have to change into factor to have a categorical levels. By using as.factor syntax.
bank$contact <- as.factor(bank$contact)
bank$education <- as.factor(bank$education)
bank$default <- as.factor(bank$default)
bank$housing <- as.factor(bank$housing)
bank$loan <- as.factor(bank$loan)
bank$poutcome <- as.factor(bank$poutcome)
bank$month <- as.factor(bank$month)
bank$y <- as.factor(bank$y)
bank$marital <- as.factor(bank$marital)
bank$job <- as.factor(bank$job)
bank$day <- as.factor(bank$day)
bank$campaign <- as.factor(bank$campaign)
bank$age <- as.numeric(bank$age)
glimpse(bank)
## Rows: 4,521
## Columns: 17
## $ age <dbl> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 31, …
## $ job <fct> unemployed, services, management, management, blue-collar, m…
## $ marital <fct> married, married, single, married, married, single, married,…
## $ education <fct> primary, secondary, tertiary, tertiary, secondary, tertiary,…
## $ default <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
## $ balance <int> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374, 26…
## $ housing <fct> no, yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, no…
## $ loan <fct> no, yes, no, yes, no, no, no, no, no, yes, no, no, no, no, y…
## $ contact <fct> cellular, cellular, cellular, unknown, unknown, cellular, ce…
## $ day <fct> 19, 11, 16, 3, 5, 23, 14, 6, 14, 17, 20, 17, 13, 30, 29, 29,…
## $ month <fct> oct, may, apr, jun, may, feb, may, may, may, apr, may, apr, …
## $ duration <int> 79, 220, 185, 199, 226, 141, 341, 151, 57, 313, 273, 113, 32…
## $ campaign <fct> 1, 1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 5, 1, 1, 1, …
## $ pdays <int> -1, 339, 330, -1, -1, 176, 330, -1, -1, 147, -1, -1, -1, -1,…
## $ previous <int> 0, 4, 1, 0, 0, 3, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, 1, …
## $ poutcome <fct> unknown, failure, failure, unknown, unknown, failure, other,…
## $ y <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, yes, no,…
After cleansing the dataset, the type of data change and ready to analyze
summary(bank)
## age job marital education default
## Min. :19.00 management :969 divorced: 528 primary : 678 no :4445
## 1st Qu.:33.00 blue-collar:946 married :2797 secondary:2306 yes: 76
## Median :39.00 technician :768 single :1196 tertiary :1350
## Mean :41.17 admin. :478 unknown : 187
## 3rd Qu.:49.00 services :417
## Max. :87.00 retired :230
## (Other) :713
## balance housing loan contact day
## Min. :-3313 no :1962 no :3830 cellular :2896 20 : 257
## 1st Qu.: 69 yes:2559 yes: 691 telephone: 301 18 : 226
## Median : 444 unknown :1324 19 : 201
## Mean : 1423 21 : 198
## 3rd Qu.: 1480 14 : 195
## Max. :71188 17 : 191
## (Other):3253
## month duration campaign pdays
## may :1398 Min. : 4 1 :1734 Min. : -1.00
## jul : 706 1st Qu.: 104 2 :1264 1st Qu.: -1.00
## aug : 633 Median : 185 3 : 558 Median : -1.00
## jun : 531 Mean : 264 4 : 325 Mean : 39.77
## nov : 389 3rd Qu.: 329 5 : 167 3rd Qu.: -1.00
## apr : 293 Max. :3025 6 : 155 Max. :871.00
## (Other): 571 (Other): 318
## previous poutcome y
## Min. : 0.0000 failure: 490 no :4000
## 1st Qu.: 0.0000 other : 197 yes: 521
## Median : 0.0000 success: 129
## Mean : 0.5426 unknown:3705
## 3rd Qu.: 0.0000
## Max. :25.0000
##
Summary of Bank Marketing Dataset:
The age has a range from 19 yo until 87 yo
The top 3 of occupation are management, blue-collar and technician
Check the histogram for customer based on their age
hist(bank$age)
Insight: From the Histogram above, the interval range are mostly from 20
yo and 60 yo, that means it’s a good move to make more campaign in their
rise age.
Check the dataset if there are missing values
anyNA(bank)
## [1] FALSE
colSums(is.na(bank))
## age job marital education default balance housing loan
## 0 0 0 0 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 0 0 0 0 0 0 0 0
## y
## 0
check the first 10 of dataset based on age, marital and job
bank[1:10,c("age","marital","job")]
Based on my opinion, according to get a potential target, we have to set several requirment with balance more 20000 and the people who are using cellular phone, which means a modern people and hi tech tools, it should be have more accurate of potential market. using filter at the row of dataset
bank[bank$balance > 20000 & bank$contact == "cellular",]
aggregate(x= balance ~ job, data = bank, FUN = mean, decreasing = T)
You can also embed plots, for example:
From
scatter plot above, can be determined that most people have a balance
below 20.000, only 1 person who has balance above 60.000
plot(xtabs(balance ~ contact+y,bank))
From xtabs above can be concluded a celullar user are the most
frequently use compare the other, then the amount of customers who are
closing is less than