Explanatory Data for Marketing of Deposit Products in Portuguese banking
1. Overview Data :“Bank-Additional-Full.csv”
Data “bank-additional-full.csv” is a deposit product marketing data which is based on the phone calls of each client at a Portuguese banking institution. Data “bank-additional-full.csv” is data on the marketing of deposit products based on phone calls of each client at a Portuguese banking institution. The data consists of 20 input variables containing the client’s identity. This data is used to predict which clients will buy deposit products when the bank calls them.
2. Data Inspection
Data inspection helps us determine that data import has been executed correctly, that variables are in same length (rows) and breadth (columns) and that variables (columns) are in the same format as expected. In the R program, there are several basic functions that are often used for data inspection, namely read.csv(), head(),tail(),dim(), names().
a.Use read.csv() for input and read data
That must be considered in inputting data is that “data.csv” (data used) is in the same folder as the R project.
bank.data<-read.csv2("bank-additional-full.csv")b. Use head() and tail()
head(bank.data) age job marital education default housing loan contact month
1 56 housemaid married basic.4y no no no telephone may
2 57 services married high.school unknown no no telephone may
3 37 services married high.school no yes no telephone may
day_of_week duration campaign pdays previous poutcome emp.var.rate
1 mon 261 1 999 0 nonexistent 1.1
2 mon 149 1 999 0 nonexistent 1.1
3 mon 226 1 999 0 nonexistent 1.1
cons.price.idx cons.conf.idx euribor3m nr.employed y
1 93.994 -36.4 4.857 5191 no
2 93.994 -36.4 4.857 5191 no
3 93.994 -36.4 4.857 5191 no
[ reached 'max' / getOption("max.print") -- omitted 3 rows ]
tail(bank.data) age job marital education default housing loan contact
41183 29 unemployed single basic.4y no yes no cellular
41184 73 retired married professional.course no yes no cellular
41185 46 blue-collar married professional.course no no no cellular
month day_of_week duration campaign pdays previous poutcome
41183 nov fri 112 1 9 1 success
41184 nov fri 334 1 999 0 nonexistent
41185 nov fri 383 1 999 0 nonexistent
emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
41183 -1.1 94.767 -50.8 1.028 4963.6 no
41184 -1.1 94.767 -50.8 1.028 4963.6 yes
41185 -1.1 94.767 -50.8 1.028 4963.6 no
[ reached 'max' / getOption("max.print") -- omitted 3 rows ]
c. Use dim() and names()
dim(bank.data)[1] 41188 21
names(bank.data) [1] "age" "job" "marital" "education"
[5] "default" "housing" "loan" "contact"
[9] "month" "day_of_week" "duration" "campaign"
[13] "pdays" "previous" "poutcome" "emp.var.rate"
[17] "cons.price.idx" "cons.conf.idx" "euribor3m" "nr.employed"
[21] "y"
3. Data Cleansing
Check data type for each column
str(bank.data)'data.frame': 41188 obs. of 21 variables:
$ age : int 56 57 37 40 56 45 59 41 24 25 ...
$ job : chr "housemaid" "services" "services" "admin." ...
$ marital : chr "married" "married" "married" "married" ...
$ education : chr "basic.4y" "high.school" "high.school" "basic.6y" ...
$ default : chr "no" "unknown" "no" "no" ...
$ housing : chr "no" "no" "yes" "no" ...
$ loan : chr "no" "no" "no" "no" ...
$ contact : chr "telephone" "telephone" "telephone" "telephone" ...
$ month : chr "may" "may" "may" "may" ...
$ day_of_week : chr "mon" "mon" "mon" "mon" ...
$ duration : int 261 149 226 151 307 198 139 217 380 50 ...
$ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
$ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
$ previous : int 0 0 0 0 0 0 0 0 0 0 ...
$ poutcome : chr "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
$ emp.var.rate : chr "1.1" "1.1" "1.1" "1.1" ...
$ cons.price.idx: chr "93.994" "93.994" "93.994" "93.994" ...
$ cons.conf.idx : chr "-36.4" "-36.4" "-36.4" "-36.4" ...
$ euribor3m : chr "4.857" "4.857" "4.857" "4.857" ...
$ nr.employed : chr "5191" "5191" "5191" "5191" ...
$ y : chr "no" "no" "no" "no" ...
There are some data types that are still wrong, because we have to change them
bank.data$job<-as.factor(bank.data$job)
bank.data$marital<-as.factor(bank.data$marital)
bank.data$education<-as.factor(bank.data$education)
bank.data$default<-as.factor(bank.data$default)
bank.data$housing<-as.factor(bank.data$housing)
bank.data$loan<-as.factor(bank.data$loan)
bank.data$month<-as.factor(bank.data$month)
bank.data$day_of_week<-as.factor(bank.data$day_of_week)
bank.data$poutcome<-as.factor(bank.data$poutcome)
bank.data$emp.var.rate<-as.numeric(bank.data$emp.var.rate)
bank.data$cons.price.idx<-as.numeric(bank.data$cons.price.idx)
bank.data$cons.conf.idx<-as.numeric(bank.data$cons.conf.idx)
bank.data$euribor3m<-as.numeric(bank.data$euribor3m)
bank.data$nr.employed<-as.numeric(bank.data$nr.employed)
bank.data$duration<-as.integer(bank.data$duration)
str(bank.data)'data.frame': 41188 obs. of 21 variables:
$ age : int 56 57 37 40 56 45 59 41 24 25 ...
$ job : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
$ marital : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
$ education : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
$ default : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
$ housing : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
$ loan : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
$ contact : chr "telephone" "telephone" "telephone" "telephone" ...
$ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
$ day_of_week : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
$ duration : int 261 149 226 151 307 198 139 217 380 50 ...
$ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
$ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
$ previous : int 0 0 0 0 0 0 0 0 0 0 ...
$ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
$ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
$ cons.price.idx: num 94 94 94 94 94 ...
$ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
$ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
$ nr.employed : num 5191 5191 5191 5191 5191 ...
$ y : chr "no" "no" "no" "no" ...
checking for missing value
colSums(is.na(bank.data)) age job marital education default
0 0 0 0 0
housing loan contact month day_of_week
0 0 0 0 0
duration campaign pdays previous poutcome
0 0 0 0 0
emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
0 0 0 0 0
y
0
anyNA(bank.data)[1] FALSE
4. Data Explanation
summary(bank.data) age job marital
Min. :17.00 admin. :10422 divorced: 4612
1st Qu.:32.00 blue-collar: 9254 married :24928
Median :38.00 technician : 6743 single :11568
education default housing loan
university.degree :12168 no :32588 no :18622 no :33950
high.school : 9515 unknown: 8597 unknown: 990 unknown: 990
basic.9y : 6045 yes : 3 yes :21576 yes : 6248
contact month day_of_week duration
Length:41188 may :13769 fri:7827 Min. : 0.0
Class :character jul : 7174 mon:8514 1st Qu.: 102.0
Mode :character aug : 6178 thu:8623 Median : 180.0
campaign pdays previous poutcome
Min. : 1.000 Min. : 0.0 Min. :0.000 failure : 4252
1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.000 nonexistent:35563
Median : 2.000 Median :999.0 Median :0.000 success : 1373
emp.var.rate cons.price.idx cons.conf.idx euribor3m
Min. :-3.40000 Min. :92.20 Min. :-50.8 Min. :0.634
1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7 1st Qu.:1.344
Median : 1.10000 Median :93.75 Median :-41.8 Median :4.857
nr.employed y
Min. :4964 Length:41188
1st Qu.:5099 Class :character
Median :5191 Mode :character
[ reached getOption("max.print") -- omitted 4 rows ]
Insight from summary:
- Minimum age of the client is 17 years.
- Most of the clients have jobs as admin.
- Most of the clients had recent education at university degree.
- There are three clients who have credit defaults.
- Most of the clients have housing loan.
- Most of the clients do not have personal loans.
- Month / day client’s last contact is often in May / Thursday.
- The median duration of client telephone contact was three minutes.
- The number of contacts that were called during the marketing of 2000 contacts
- The result of the previous marketing was nonexsiten
Check the Outlier within duration
aggregate(duration~poutcome,bank.data,mean) poutcome duration
1 failure 248.1129
2 nonexistent 257.0822
3 success 320.9417
aggregate(duration~poutcome,bank.data,var) poutcome duration
1 failure 53471.00
2 nonexistent 68778.65
3 success 65403.03
aggregate(duration~poutcome,bank.data,sd) poutcome duration
1 failure 231.2380
2 nonexistent 262.2568
3 success 255.7402
boxplot(bank.data$duration)From the results of the visualization above we find the possibility for outliers, but from our calculation the SD value is around 200 which is still tolerable, so the process can be continued.
5. Bussines Question
- Which is job gives the lowest duration number?
bank.data[bank.data$duration == 0,] age job marital education default housing loan contact
6252 39 admin. married high.school no yes no telephone
23032 59 management married university.degree no yes no cellular
28064 53 blue-collar divorced high.school no yes no cellular
month day_of_week duration campaign pdays previous poutcome
6252 may tue 0 4 999 0 nonexistent
23032 aug tue 0 10 999 0 nonexistent
28064 apr fri 0 3 999 0 nonexistent
emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
6252 1.1 93.994 -36.4 4.857 5191.0 no
23032 1.4 93.444 -36.1 4.965 5228.1 no
28064 -1.8 93.075 -47.1 1.479 5099.1 no
[ reached 'max' / getOption("max.print") -- omitted 1 rows ]
Answer: The client jobs with the lowest call duration are admin, management, and blue collar
2. what jobs he called on that Friday in April?
Answer:
no2<-bank.data[bank.data$month=="apr"& bank.data$day_of_week=="fri",]
(table(no2$job))
admin. blue-collar entrepreneur housemaid management
149 166 26 10 56
retired self-employed services student technician
17 12 62 20 79
unemployed unknown
12 1
In April, July, July, clients who do not have personal loan and job have the highest campaign mean?
Answer:
no3<-bank.data[bank.data$month==c("apr","jun","jul"),]
dt<-aggregate(campaign~job+default,no2,mean)
dt[dt$default=="no",] job default campaign
1 admin. no 1.870504
2 blue-collar no 1.775000
3 entrepreneur no 1.708333
4 housemaid no 1.625000
5 management no 1.688889
6 retired no 1.785714
7 self-employed no 1.750000
8 services no 1.777778
9 student no 2.529412
10 technician no 1.638889
11 unemployed no 1.800000
12 unknown no 2.000000