The data is related with direct marketing campaigns of a Portuguese banking institution.We are trying to analyze clients of a Portuguese banking institution which we classified based on 16 categories. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.
bank <- read.csv("bank/bank-full.csv",sep = ";")
str(bank)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : chr "management" "technician" "entrepreneur" "blue-collar" ...
## $ marital : chr "married" "single" "married" "married" ...
## $ education: chr "tertiary" "secondary" "secondary" "unknown" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : chr "yes" "yes" "yes" "yes" ...
## $ loan : chr "no" "no" "yes" "no" ...
## $ contact : chr "unknown" "unknown" "unknown" "unknown" ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : chr "may" "may" "may" "may" ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "unknown" "unknown" "unknown" "unknown" ...
## $ y : chr "no" "no" "no" "no" ...
str(bank)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : chr "management" "technician" "entrepreneur" "blue-collar" ...
## $ marital : chr "married" "single" "married" "married" ...
## $ education: chr "tertiary" "secondary" "secondary" "unknown" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : chr "yes" "yes" "yes" "yes" ...
## $ loan : chr "no" "no" "yes" "no" ...
## $ contact : chr "unknown" "unknown" "unknown" "unknown" ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : chr "may" "may" "may" "may" ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "unknown" "unknown" "unknown" "unknown" ...
## $ y : chr "no" "no" "no" "no" ...
head(bank)
## age job marital education default balance housing loan contact day
## 1 58 management married tertiary no 2143 yes no unknown 5
## 2 44 technician single secondary no 29 yes no unknown 5
## 3 33 entrepreneur married secondary no 2 yes yes unknown 5
## 4 47 blue-collar married unknown no 1506 yes no unknown 5
## 5 33 unknown single unknown no 1 no no unknown 5
## 6 35 management married tertiary no 231 yes no unknown 5
## month duration campaign pdays previous poutcome y
## 1 may 261 1 -1 0 unknown no
## 2 may 151 1 -1 0 unknown no
## 3 may 76 1 -1 0 unknown no
## 4 may 92 1 -1 0 unknown no
## 5 may 198 1 -1 0 unknown no
## 6 may 139 1 -1 0 unknown no
tail(bank)
## age job marital education default balance housing loan
## 45206 25 technician single secondary no 505 no yes
## 45207 51 technician married tertiary no 825 no no
## 45208 71 retired divorced primary no 1729 no no
## 45209 72 retired married secondary no 5715 no no
## 45210 57 blue-collar married secondary no 668 no no
## 45211 37 entrepreneur married secondary no 2971 no no
## contact day month duration campaign pdays previous poutcome y
## 45206 cellular 17 nov 386 2 -1 0 unknown yes
## 45207 cellular 17 nov 977 3 -1 0 unknown yes
## 45208 cellular 17 nov 456 2 -1 0 unknown yes
## 45209 cellular 17 nov 1127 5 184 3 success yes
## 45210 telephone 17 nov 508 4 -1 0 unknown no
## 45211 cellular 17 nov 361 2 188 11 other no
dim(bank)
## [1] 45211 17
names(bank)
## [1] "age" "job" "marital" "education" "default" "balance"
## [7] "housing" "loan" "contact" "day" "month" "duration"
## [13] "campaign" "pdays" "previous" "poutcome" "y"
From our inspection we can conclude:
-bank data contain 9994 of rows and 15 of columns
Each column name with it’s explanation:
1 - age
2 - job : type of job
3 - marital : marital status
4 - education
5 - default: has credit in default?
6 - balance
7 - housing: has housing loan?
8 - loan: has personal loan?
9 - contact: contact communication type
10 - month: last contact month of year
11 - day_of_week: last contact day of the week
12 - duration: last contact duration, in seconds
13 - campaign: number of contacts performed during this campaign and for this client
14 - pdays: number of days that passed by after the client was last contacted from a previous campaign
15 - previous: number of contacts performed before this campaign and for this client
16 - poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)
17 - y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)
check data type for each column
str(bank)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : chr "management" "technician" "entrepreneur" "blue-collar" ...
## $ marital : chr "married" "single" "married" "married" ...
## $ education: chr "tertiary" "secondary" "secondary" "unknown" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : chr "yes" "yes" "yes" "yes" ...
## $ loan : chr "no" "no" "yes" "no" ...
## $ contact : chr "unknown" "unknown" "unknown" "unknown" ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : chr "may" "may" "may" "may" ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "unknown" "unknown" "unknown" "unknown" ...
## $ y : chr "no" "no" "no" "no" ...
From this result, we find some of data type not in the corect type. we need to convert it into corect type (data coertion)
names <- c("job","marital","education","default","housing","loan","contact","month","day","poutcome","y")
bank[,names] <- lapply(bank[,names] , as.factor)
str(bank)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ day : Factor w/ 31 levels "1","2","3","4",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Each of column had already changed into each own desired data type
check for missing values
anyNA(bank)
## [1] FALSE
Great!! No missing value
Brief explanation
summary(bank)
## age job marital education
## Min. :18.00 blue-collar:9732 divorced: 5207 primary : 6851
## 1st Qu.:33.00 management :9458 married :27214 secondary:23202
## Median :39.00 technician :7597 single :12790 tertiary :13301
## Mean :40.94 admin. :5171 unknown : 1857
## 3rd Qu.:48.00 services :4154
## Max. :95.00 retired :2264
## (Other) :6835
## default balance housing loan contact
## no :44396 Min. : -8019 no :20081 no :37967 cellular :29285
## yes: 815 1st Qu.: 72 yes:25130 yes: 7244 telephone: 2906
## Median : 448 unknown :13020
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
##
## day month duration campaign
## 20 : 2752 may :13766 Min. : 0.0 Min. : 1.000
## 18 : 2308 jul : 6895 1st Qu.: 103.0 1st Qu.: 1.000
## 21 : 2026 aug : 6247 Median : 180.0 Median : 2.000
## 17 : 1939 jun : 5341 Mean : 258.2 Mean : 2.764
## 6 : 1932 nov : 3970 3rd Qu.: 319.0 3rd Qu.: 3.000
## 5 : 1910 apr : 2932 Max. :4918.0 Max. :63.000
## (Other):32344 (Other): 6060
## pdays previous poutcome y
## Min. : -1.0 Min. : 0.0000 failure: 4901 no :39922
## 1st Qu.: -1.0 1st Qu.: 0.0000 other : 1840 yes: 5289
## Median : -1.0 Median : 0.0000 success: 1511
## Mean : 40.2 Mean : 0.5803 unknown:36959
## 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :871.0 Max. :275.0000
##
summary(bank$job)
## admin. blue-collar entrepreneur housemaid management
## 5171 9732 1487 1240 9458
## retired self-employed services student technician
## 2264 1579 4154 938 7597
## unemployed unknown
## 1303 288
summary(bank$education)
## primary secondary tertiary unknown
## 6851 23202 13301 1857
prop.table(table(bank$month)) * 100
##
## apr aug dec feb jan jul jun
## 6.4851474 13.8174338 0.4733361 5.8591936 3.1032271 15.2507133 11.8134967
## mar may nov oct sep
## 1.0550530 30.4483422 8.7810489 1.6323461 1.2806618
Summary:
1.The minimum age of the clients was 18 meanwhile the maximum age was 95
2.Most of the client’s jobs came from blue-collar and management field
3.around 51 percent of the clients were secondary education-background
4.44396 from 45211 clients had no credit in default but there were 7244 clients who had personal loans
5.approximately 30 percent of the clients were called on may
6.roughly there were more than 75 percent of the clients had never been contacted before
7.From the data we can see that only 5289 from 45211 decided to subscribe the term deposit
#consider only the data which the clients subscribe
bank_yes <- bank[bank$y == "yes" , ]
1.What were the range of age did subscribe deposit the most
temp <- prop.table(table(bank_yes$age))*100
plot(temp,type = "o")
The plot is interesting because it looks like a hill.
As we can see from the plot that from 18 until 29 years old, the amount of clients were increasing and then the peak was from 30 to 35 years old,but then the trend was decreasing afterwards
table(bank_yes$job)
##
## admin. blue-collar entrepreneur housemaid management
## 631 708 123 109 1301
## retired self-employed services student technician
## 516 187 369 269 840
## unemployed unknown
## 202 34
Most subscribers came from management field which is 1301 clients and the least subscribers came from the unknown job which contributes 34 clients
3.From each educational status what were the percentage of them subscribed the term deposit
prop.table(table(bank$education,bank$y),margin = 1)
##
## no yes
## primary 0.91373522 0.08626478
## secondary 0.89440565 0.10559435
## tertiary 0.84993610 0.15006390
## unknown 0.86429725 0.13570275
From the prop table we could see that from each educational status, the amount that deposit was small, only from range 8% (primary sector) to 15%(tertiary sector)
4.Balance information of clients who subscribed the deposit
nrow(bank_yes)
## [1] 5289
summary(bank_yes$balance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3058 210 733 1804 2159 81204
boxplot(bank_yes$balance)
From 5289 data,the median was 752$ and the mean is 1828$ while the max balance was 81204$, the boxplot could tell us that there were many high outliers
5.What job had the most and least personal loan
temp<-bank
temp$loan <- sapply(X = as.character(temp$loan),FUN = switch,"no" = "0","yes" = "1")
temp$loan <- as.integer(temp$loan)
agg <- aggregate.data.frame(list(has_personal_loan = temp$loan),by = list(job = temp$job),FUN = sum)
agg <- agg[order(agg$has_personal_loan,decreasing = T),]
agg
## job has_personal_loan
## 2 blue-collar 1684
## 10 technician 1309
## 5 management 1253
## 1 admin. 991
## 8 services 836
## 3 entrepreneur 356
## 6 retired 309
## 7 self-employed 229
## 4 housemaid 152
## 11 unemployed 109
## 9 student 12
## 12 unknown 4
We could see from the data that the job that had the most personal loan came from blue-collar sector meanwhile the least persoal loan came from unemployed sector
6.Meanwhile what job had the most and least housing loan
temp<-bank
temp$housing <- sapply(X = as.character(temp$housing),FUN = switch,"no" = "0","yes" = "1")
temp$housing <- as.integer(temp$housing)
agg <- aggregate.data.frame(list(has_housing_loan = temp$housing),by = list(job = temp$job),FUN = sum)
agg <- agg[order(agg$has_housing_loan,decreasing = T),]
agg
## job has_housing_loan
## 2 blue-collar 7048
## 5 management 4678
## 10 technician 4115
## 1 admin. 3182
## 8 services 2766
## 3 entrepreneur 869
## 7 self-employed 765
## 11 unemployed 543
## 6 retired 491
## 4 housemaid 398
## 9 student 249
## 12 unknown 26
The job that had the most housing loan was still blue-collar but the least housing loan was housemaid
7.Was there a big correlation between the one who had personal loan and housing loan
cor(x = bank$housing == "yes" ,y = bank$loan == "yes")
## [1] 0.04132287
nrow(bank[bank$housing == "yes" & bank$loan == "yes",])
## [1] 4367
The correlation of the same client had both personal loan and housing loan were so small. There were only 4367 from 45211 clients
8.What is the percentage of the clients who didn’t have housing loan neither personal loan subscribe the deposit over all of the client that subscribe the deposit
x <- nrow(bank_yes[bank_yes$housing == "no" & bank_yes$loan == "no",])
y <- nrow(bank_yes)
x/y * 100
## [1] 59.27396
59% who subscribe the deposit didn’t have housing loan neither personal loan
9.What is the mean and median last call duration before the client subscribe the deposit
durasi <- bank_yes$duration
#because the default is in seconds we convert first into minutes
durasi <- durasi / 60
hist(durasi)
mean(durasi)
## [1] 8.95491
median(durasi)
## [1] 7.1
The mean of last call duration was 8.9 minutes while the median was 7.1 minutes
10.Simple statistics of the number of times did the bank need to call before the client subscribed
kampain <- bank_yes$campaign
fivenum(kampain)
## [1] 1 1 2 3 32
mean(kampain)
## [1] 2.141047
boxplot(kampain)
The mean was 2.14
The minimum and lower quartile was 1,the median was 2 , the upper quartile was 3 and the maximum was 32
The data from the fivenum and boxplot could tell us that there were some high outliers
11.We could see that the mean and the median, both were around at 2 times calling.From all clients that had been called 2 times,what was the percentage that they refused to subscribe the deposit
nrow(bank[bank$campaign == 2 & bank$y == "no",])/nrow(bank[bank$campaign == 2,]) * 100
## [1] 88.79648
The interesting part was 88% of clients that had been called twice refused to subscribe the deposit