1. About Dataset
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (or not) subscribed.
2. Read Data
Let’s read the data and understand the meaning of each attribute in the dataset :
Load Data
bank <- read.csv('Data Input/Bank Marketing/bank-full.csv',sep = ';')
datatable(bank, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html
Meta Data
Input variables bank :
- Client Data
- age (numeric)
- job : type of job (categorical):
- “admin.”
- “unknown”
- “unemployed”
- “management”
- “housemaid”
- “entrepreneur”
- “student”
- “blue-collar”
- “self-employed”
- “retired”
- “technician”
- “services”
- marital : marital status (categorical):
- “married”
- “divorced”
- “single”, note: “divorced” means divorced or widowed
- education (categorical):
- “unknown”
- “secondary”
- “primary”
- “tertiary”
- default: has credit in default? (binary: “yes”,“no”)
- balance: average yearly balance, in euros (numeric)
- housing: has housing loan? (binary: “yes”,“no”)
- loan: has personal loan? (binary: “yes”,“no”)
- Related with the last contact of the current campaign:
- contact: contact communication type (categorical):
- “unknown”
- “telephone”
- “cellular”
- day: last contact day of the month (numeric)
- month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
- duration: last contact duration, in seconds (numeric)
- contact: contact communication type (categorical):
- Other attributes:
- campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
- pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
- previous: number of contacts performed before this campaign and for this client (numeric)
- poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)
- y - has the client subscribed a term deposit? (binary: “yes”,“no”)
3. Data Cleansing
Let’s find out more about our data
3.1. Data Wrangling
# Data structure check
str(bank)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : chr "management" "technician" "entrepreneur" "blue-collar" ...
## $ marital : chr "married" "single" "married" "married" ...
## $ education: chr "tertiary" "secondary" "secondary" "unknown" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : chr "yes" "yes" "yes" "yes" ...
## $ loan : chr "no" "no" "yes" "no" ...
## $ contact : chr "unknown" "unknown" "unknown" "unknown" ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : chr "may" "may" "may" "may" ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "unknown" "unknown" "unknown" "unknown" ...
## $ y : chr "no" "no" "no" "no" ...
There are still attributes with the wrong data type in our data, such as logical which is still in character form. Let’s check the other data types whether they are correct or not
We have to check the unique value of each column so that we can determine the right data types for factor
# Check and count unique value
apply(bank, 2, function(x) length(unique(x)))
## age job marital education default balance housing loan
## 77 12 3 4 2 7168 2 2
## contact day month duration campaign pdays previous poutcome
## 3 31 12 1573 48 559 41 4
## y
## 2
Because the number of unique values is much less than the number of observations, there are several attributes that must be changed to the data type into factor form.
bank <- bank %>%
mutate_at(vars(job, marital, education, default, housing, loan, contact, poutcome, month, y), as.factor)
str(bank)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
3.2. Missing Value check
colSums(is.na(bank))
## age job marital education default balance housing loan
## 0 0 0 0 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 0 0 0 0 0 0 0 0
## y
## 0
Based on the results obtained, there is no missing value in our data
But we have to be suspicious of other missing value writing formats that might be in the dataset . let’s check the word “unknown” in the dataset
bank %>%
summarise_all(list(~sum(. == "unknown")))
## age job marital education default balance housing loan contact day month
## 1 0 288 0 1857 0 0 0 0 13020 0 0
## duration campaign pdays previous poutcome y
## 1 0 0 0 0 36959 0
Turns out that there is another missing value format in our dataset, namely “unknown”.
# Frequency target variable
table(bank$y)
##
## no yes
## 39922 5289
# Frequency poutcome vs target variable
table(bank$y, bank$poutcome)
##
## failure other success unknown
## no 4283 1533 533 33573
## yes 618 307 978 3386
But we can’t delete the data from our data because it contains a lot of data for the “yes” category in the target variable
3.3. Cek Duplicate
bank[duplicated(bank)]
## data frame with 0 columns and 45211 rows
No duplicate data in dataset
4. Data Understanding
4.1. Data dimensision check
dim(bank)
## [1] 45211 17
The data we use has dimensions of 45211 x 17, where there are 45211 observations (row) and 17 variables (columns).
4.2. Summary check
summary(bank)
## age job marital education
## Min. :18.00 blue-collar:9732 divorced: 5207 primary : 6851
## 1st Qu.:33.00 management :9458 married :27214 secondary:23202
## Median :39.00 technician :7597 single :12790 tertiary :13301
## Mean :40.94 admin. :5171 unknown : 1857
## 3rd Qu.:48.00 services :4154
## Max. :95.00 retired :2264
## (Other) :6835
## default balance housing loan contact
## no :44396 Min. : -8019 no :20081 no :37967 cellular :29285
## yes: 815 1st Qu.: 72 yes:25130 yes: 7244 telephone: 2906
## Median : 448 unknown :13020
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
##
## day month duration campaign
## Min. : 1.00 may :13766 Min. : 0.0 Min. : 1.000
## 1st Qu.: 8.00 jul : 6895 1st Qu.: 103.0 1st Qu.: 1.000
## Median :16.00 aug : 6247 Median : 180.0 Median : 2.000
## Mean :15.81 jun : 5341 Mean : 258.2 Mean : 2.764
## 3rd Qu.:21.00 nov : 3970 3rd Qu.: 319.0 3rd Qu.: 3.000
## Max. :31.00 apr : 2932 Max. :4918.0 Max. :63.000
## (Other): 6060
## pdays previous poutcome y
## Min. : -1.0 Min. : 0.0000 failure: 4901 no :39922
## 1st Qu.: -1.0 1st Qu.: 0.0000 other : 1840 yes: 5289
## Median : -1.0 Median : 0.0000 success: 1511
## Mean : 40.2 Mean : 0.5803 unknown:36959
## 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :871.0 Max. :275.0000
##
Summary :
- Age variable is normally distributed, indicated by the median and mean values which are not much different
- The youngest client is 18 years old and the oldest is 95 years old
- In contrast to age, in the balance variable, mean value is much larger than the median which indicates there is skewness in the variable
- Inbalanced data targets, where the number of clients who subscribe is much less than those who don’t
- Previous variable has a very strange distribution, because the maximum value much larger compared to other statistical parameters
- The type of communication using telephone turns out to be the least in number compared to other types of communication
- There are far more clients who don’t have default credit than those who have default credit
5. Exploratory Data Analysis
Let’s answer some unique questions about the data set we use !!
- How old are people who want to subscribe a term deposit? Show the age distribution!
Based on the picture above, it can be seen that most of the bank’s clients are in the age range of 25-40 years It seems that the banks are not very much interested by contacting the older population. Even though, after the 60-years threshold, the relative frequency is higher when y = true. In other words, we can say that elderly persons are more likely to subscribe to a term deposit.
- What are the jobs of the client bank?
table(bank$job, bank$y)
##
## no yes
## admin. 4540 631
## blue-collar 9024 708
## entrepreneur 1364 123
## housemaid 1131 109
## management 8157 1301
## retired 1748 516
## self-employed 1392 187
## services 3785 369
## student 669 269
## technician 6757 840
## unemployed 1101 202
## unknown 254 34
Clients with this type of management job are the most who subscribe products that offered by the bank and the least are housemaid.
For management, it is very natural to be the one who buys the most products offered, because maybe they already understand financial awarness and care about future investments. While housemaid may still not understand about financial awareness and investment techniques
- Does marital status affect?
plot(table(bank$marital, bank$y))
table(bank$marital, bank$y)
##
## no yes
## divorced 4585 622
## married 24459 2755
## single 10878 1912
Based on the plot above, it can be seen that marital status does not significantly affect whether someone wants to subscribe the product offered or not
- How about outcome of the previous marketing campaign?
table(bank$poutcome, bank$y)
##
## no yes
## failure 4283 618
## other 1533 307
## success 533 978
## unknown 33573 3386
There is people who already subscribed to a term deposit after a previous contact have accepted to do it again. Even if they were denied before, they’re still more enthusiastic to accept it So even if the previous campaign was a failure, recontacting people seems important.
6. Conclusion
So, how do increase the potential of clients to want to buy the products that the bank offers?
Based on the analysis we have done, there are various ways to increase the number of clients who want to buy the products we offer.
Banks should start increasing the promotion of their products to their clients aged 60 years and over because based on the exploration carried out, it turns out that there are facts that many clients who subscribe bank term deposit are aged 60 years and over.
Before promoting the products offered, banks should also conduct socialization about the importance of investing to increase financial awareness from all circles of society, so that people understand and are familiar with the products offered by banks.
It is also important to always do promotions to each client even though the client has subscribed the product before. Re-promotion will increase the number of people who want to subscribe the product that the bank offers