Bank is a financial institution licensed to receive deposits and make loans. It may also provides financial services such as currency exchange and term deposit subscriptions. As a financial institution, it is important for bank to have some insight on what factors affect the customers’ choices to use their financial services. In this document, I am going to analyze a bank marketing data to gain some information on what influences people to make loans and term deposit subscription. The data is related with direct marketing campaigns based on phone calls of a Portuguese banking institution. I know that banking institutions also need something that allows them to classify customers, i.e. knowing which customer will be interested on their product, and the dataset was created for a classifying task. However, since it is out of scope, I will not perform any modeling here. If you really want to try to construct a classifier for this dataset, I would recommend that you split the data into train and test set, then create a model using Logistic Regression.
These following packages are required in this notebook. Use install.packages() to install any packages that are not already downloaded and load them using library() function. I provided a brief explanation about their function.
library(reshape2)
library(tigerstats)bank <- read.csv("data_input/bank-full.csv", sep = ";", stringsAsFactors = T)
head(bank)## age job marital education default balance housing loan contact day
## 1 58 management married tertiary no 2143 yes no unknown 5
## 2 44 technician single secondary no 29 yes no unknown 5
## 3 33 entrepreneur married secondary no 2 yes yes unknown 5
## 4 47 blue-collar married unknown no 1506 yes no unknown 5
## 5 33 unknown single unknown no 1 no no unknown 5
## 6 35 management married tertiary no 231 yes no unknown 5
## month duration campaign pdays previous poutcome y
## 1 may 261 1 -1 0 unknown no
## 2 may 151 1 -1 0 unknown no
## 3 may 76 1 -1 0 unknown no
## 4 may 92 1 -1 0 unknown no
## 5 may 198 1 -1 0 unknown no
## 6 may 139 1 -1 0 unknown no
Information about the features:
dim(bank)## [1] 45211 17
We can see that there are 45.211 instances and 17 features in this dataset. Now, let’s check the data types.
str(bank)## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
All columns already have the correct data types. However, I am concerned about the day and month features. There is no information about whether the data was collected in the same year or not. If the data was collected not in the same year, those features will be meaningless, because the exact dates are not provided. For now, I will drop the day and month features.
bank <- subset(bank, select = -c(day, month))
head(bank)## age job marital education default balance housing loan contact
## 1 58 management married tertiary no 2143 yes no unknown
## 2 44 technician single secondary no 29 yes no unknown
## 3 33 entrepreneur married secondary no 2 yes yes unknown
## 4 47 blue-collar married unknown no 1506 yes no unknown
## 5 33 unknown single unknown no 1 no no unknown
## 6 35 management married tertiary no 231 yes no unknown
## duration campaign pdays previous poutcome y
## 1 261 1 -1 0 unknown no
## 2 151 1 -1 0 unknown no
## 3 76 1 -1 0 unknown no
## 4 92 1 -1 0 unknown no
## 5 198 1 -1 0 unknown no
## 6 139 1 -1 0 unknown no
Checking the missing values in each column.
colSums(is.na(bank))## age job marital education default balance housing loan
## 0 0 0 0 0 0 0 0
## contact duration campaign pdays previous poutcome y
## 0 0 0 0 0 0 0
The data seems to not have any missing values. This allows us to proceed to the analysis part without needing any data imputation. However, I am a bit concerned about poutcome feature. Actually, There are 2 features: contact and poutcome, which have “unknown” values. I will first change all unknown values into NULL, and recheck the missing values.
bank[bank == "unknown"] <- NA
colSums(is.na(bank))## age job marital education default balance housing loan
## 0 288 0 1857 0 0 0 0
## contact duration campaign pdays previous poutcome y
## 13020 0 0 0 0 36959 0
Now it is clear that the data has some missing values. First, let’s check on contact feature.
table(bank$contact)##
## cellular telephone unknown
## 29285 2906 0
For contact feature, I do not think that the “unknown” values are important, since it does not really matter though which media we contact people when doing a marketing campaign, as long as we are able to reach and communicate with our potential customers.
table(bank$poutcome)##
## failure other success unknown
## 4901 1840 1511 0
The poutcome feature, however, seems pretty important. It tells us whether the previous marketing campaign succeeded or not. There are too many “unknown” values in this feature. If I were to impute the values, I am afraid the result will not be good either. So, I will drop both contact and poutcome feature, and drop all the rows containing missing values in job and education, since it only contains a little missing values.
bank <- subset(bank, select = -c(contact, poutcome))
bank <- na.omit(bank)
bank <- droplevels.data.frame(bank)Here, I have constructed several questions that will help analyze the data. Generally, the questions can be classified to get several kinds of information:
Firstly, let’s take a look at the distribution of age.
bank.loan <- bank[bank$loan == "yes" | bank$housing == "yes", ]
bank.no.loan <- bank[bank$loan == "no" & bank$housing == "no", ]
hist(bank.loan$age,
col = "indianred",
xlab = "Age",
main = "Distribution of Ages")
hist(bank.no.loan$age,
col = "skyblue4",
add = TRUE)
legend("topright",
c("Loan", "No Loan"),
col = c("indianred", "skyblue4"),
lwd = 5)From the plot, we cannot conclude that people between 30 and 35 years old tend to have some loans, although it has the highest frequency. This is because the height of the distribution (i.e. the high frequency value) comes from the data itself. It simply means that the majority of the data consists of people between 30-35 years old. Instead, we can see the tendency towards having a loan from the difference between those 2 histogram (the distribution of people with loans and no loans) height. So, from the plot, we can conclude that people at age 30-40 tend to have loans, regardless of whether it is a housing loan or a personal loan. We can also see that beyond 40 years old, the tendency of having loans gradually decreases. This corresponds to the research on why households borrow money, which stated “the category of lack of debt applied to older people, whose main source of income was retirement and disability pensions” (Strzelecka and Zawadzka, 2020).
Next, I am interested in knowing what types of job tend to have loans. Before that, let’s take a look at the available types of jobs in the dataset.
table(bank$job)##
## admin. blue-collar entrepreneur housemaid management
## 5000 9278 1411 1195 9216
## retired self-employed services student technician
## 2145 1540 4004 775 7355
## unemployed
## 1274
Now, let’s see what types of job that tend to have loans. I will create a new column called any.loan that stores binary values:
# Combine housing and loan
bank$any.loan <- paste(bank$housing, bank$loan)
# Return "no" if a person does not have both housing loan and personal loan
yn <- function(x){
if(x == "no no"){
x <- "no"
} else {
x <- "yes"
}
}
# Apply the function
bank$any.loan <- sapply(bank$any.loan, yn)Creating a dataframe consists of the percentage of people who have loans for each type of job.
job.table <- as.data.frame.matrix(xtabs(~ job + any.loan,
data = bank))
job.table <- round(job.table / rowSums(job.table), 2)
job.table <- subset(job.table[order(job.table$yes, decreasing = T), ],
select = "yes")
colnames(job.table) <- "Percentage of People with Loans"
job.table## Percentage of People with Loans
## blue-collar 0.79
## services 0.74
## admin. 0.70
## entrepreneur 0.69
## technician 0.61
## management 0.55
## self-employed 0.55
## unemployed 0.45
## housemaid 0.40
## retired 0.32
## student 0.31
As we can see from the table above, 78% of blue-collar workers have loans. Perhaps it is because of the low revenue. Entrepreneurs having loan are pretty logical, since they need a lot of money to start their own business. There is an interesting thing here. Why are there only 45% unemployed people who have loans? I personally thought that unemployed people tend to borrow money for their daily necessities. Now, it seems like people tend to borrow money for working, but it does not seem like it because of jobs that does not need to use our own capital, such as admin and management. Let’s add a new feature to the table, balance, to gain some more insight.
job.agg <- aggregate(balance ~ job + any.loan,
data = bank,
FUN = mean)
dcast(job.agg, job ~ any.loan)## job no yes
## 1 admin. 1522.025 945.1443
## 2 blue-collar 1530.309 957.5999
## 3 entrepreneur 1999.530 1300.1508
## 4 housemaid 1533.990 1038.5166
## 5 management 2052.073 1538.0472
## 6 retired 2410.072 1145.3678
## 7 self-employed 2122.139 1264.6686
## 8 services 1084.511 932.2357
## 9 student 1453.854 1184.5455
## 10 technician 1350.302 1161.7962
## 11 unemployed 1830.785 1148.4021
It is now clear what really affects people in borrowing money. It is not a matter of what kind of job do people have, but rather how much balance do people have in average every year. I am interested in finding the threshold, i.e. in which range of yearly average balance people tend to borrow money.
plot(density(bank.loan$balance),
col = "goldenrod",
xlab = "Average Yearly Balance (Euros)",
lwd = 2,
main = "Distribution of Balance")
lines(density(bank.no.loan$balance),
lwd = 2,
col = "dodgerblue")
legend("topright",
c("Loan", "No Loan"),
col = c("goldenrod", "dodgerblue"),
lwd = 5)The plot turns out to be very skewed. Seems like there are many outliers in the dataset. Let’s focus on the range where most of the data lies.
plot(density(bank.loan$balance, from = -2000, to = 3000),
col = "goldenrod",
xlab = "Average Yearly Balance (Euros)",
lwd = 2,
main = "Distribution of Balance")
lines(density(bank.no.loan$balance),
lwd = 2,
col = "dodgerblue")
legend("topright",
c("Loan", "No Loan"),
col = c("goldenrod", "dodgerblue"),
lwd = 5)Here, we can see that people who tend to have loans are those whose average yearly balance is between 0-500 Euros. This makes sense since low average yearly balances forces people to borrow money in order to fulfill their daily necessities. However, an important thing not to miss here is the fact that some people with negative average yearly balance does not have any loan. Perhaps it is because some of the samples in the data consists of students. Let’s try to exclude the students and see the result.
plot(density(bank.loan[bank.loan$job != "student", ]$balance, from = -2000, to = 3000),
col = "goldenrod",
xlab = "Average Yearly Balance (Euros)",
lwd = 2,
main = "Distribution of Balance")
lines(density(bank.no.loan[bank.no.loan$job != "student", ]$balance),
lwd = 2,
col = "dodgerblue")
legend("topright",
c("Loan", "No Loan"),
col = c("goldenrod", "dodgerblue"),
lwd = 5)Removing the students does not change the fact that some people with negative average yearly balance does not have a loan. Further research about this case may be conducted to see the exact reason of why this event could happen. Now, we will see whether education level affects people’s tendency of having loans.
bank.edu <- bank[bank$education != "unknown", ]
bank.edu$education <- droplevels(bank.edu$education)
barplot(xtabs( ~ any.loan + education,
data = bank.edu),
col = c("indianred", "skyblue4"),
main = "Loan Tendency in Each Educational Level",
xlab = "Educational Level",
ylab = "Frequency",
beside = TRUE)
legend("topright",
c("Loan", "No Loan"),
col = c("skyblue4", "indianred"),
lwd = 5)Here in any educational level, the number of people who have loans are higher than those who don’t. The tendency on having loans might seem very obvious for people with secondary education. However, I personally think that educational level will be related to the job and the average yearly balance. I am pretty curious in finding what do people spend the borrowed money on for each type of education level. But, the data does not provide any information about it, so for now I will just leave it be.
The main factors that affect people on having loans are age and average yearly balance.
The analysis result corresponds with the research on Socio-Economic Factors Affecting Households Debts conducted by Agnieszka Strzelecka and Danuta Zawadzka.
People at age 30-40 tend to have loans, regardless of whether it is a housing loan or a personal loan. Beyond that range, the tendency of having loans gradually decreases.
People with low average yearly balance (0-500 Euros) tend to have loans. According to Strzelecka and Zawadzka (2020), the reason that this category tend to have loan is to fulfill household necessities.
The next thing I am interested in finding is the relation between loans and term deposits subscription. Firstly, let’s see whether having loans affect people’s decision on subscribing term deposits.
rowPerc(xtabs(~ any.loan + y,
data = bank))## y
## any.loan no yes Total
## no 81.78 18.22 100.00
## yes 92.29 7.71 100.00
It seems like the majority of people does not have term deposit subscription, no matter whether they have loans or not. However, we can not directly conclude that there is no relationship between these 2 features, since the percentage of people with no loans subscribing to term deposits is twice more than those with loans. We can use the Chi-Square Independence Test to see whether there is any relation between 2 categorical features, with the following hypothesis:
chisq.test(table(bank$any.loan, bank$y), correct = F)##
## Pearson's Chi-squared test
##
## data: table(bank$any.loan, bank$y)
## X-squared = 1085.9, df = 1, p-value < 2.2e-16
We get a p-value of \(2.2 \cdot 10^{-16}\). So, we can conclude that statistically, having loan is correlated with subscribing term deposits; and knowing whether someone has a loan or not can help us to predict whether he has subscribed to a term deposit or not.
I personally think that to know the effects of marketing campaigns on people’s decision to subscribe term deposit, we can use 2 features in the data: the number of contacts performed for each client during the campaign, and the duration of the last contact performed. For the latter one, I expect that longer contact duration means that the client is more interested in subscribing term deposits. Firstly, let’s see whether the number of contacts performed to a particular client during marketing campaign have any effect on term deposit subscriptions.
plot(density(bank[bank$y == "yes", ]$campaign, from = 0, to = 10),
col = "goldenrod",
xlab = "Number of Contacts Performed",
lwd = 2,
main = "Distribution of Number of Contacts Performed During Campaign")
lines(density(bank[bank$y == "no", ]$campaign, from = 0, to = 10),
lwd = 2,
col = "dodgerblue")
legend("topright",
c("Subscribed", "Unsubscribed"),
col = c("goldenrod", "dodgerblue"),
lwd = 5)The pattern seems to be random, so it seems we are unable to extract any useful information. Let’s now check on the contact duration, I will construct a box plot to see whether there is any difference in the contact duration distribution for those who have subscribed term deposits and those who have not. The data have many outliers, so I will exclude them to make the box plot clearer to see.
boxplot(duration ~ y,
data = bank,
outline = F,
main = "Contact Duration Distribution On Term Deposit Subscription",
xlab = "Term Deposit Subscription",
ylab = "Duration",
col = c("indianred", "skyblue4"))As expected, although the number of contacts performed seems to have a random pattern, the duration of last contact performed seems to be able to explain whether a client is interested in subscribing term deposits or not. Clients who are interested in term deposit subscription tend to have longer contact duration, I assume that it is because they asked a lot about the product details, e.g. the return or fee.
The analysis result on what influences people to have loans corresponds with the research on Socio-Economic Factors Affecting Households Debts conducted by Agnieszka Strzelecka and Danuta Zawadzka. Based on the analysis, the main factors that affect people on having loans are age and average yearly balance. People at age 30-40 tend to have loans, regardless of whether it is a housing loan or a personal loan. Beyond that range, the tendency of having loans gradually decreases. People with low average yearly balance (0-500 Euros) tend to have loans. It is stated in the research conducted by Strzelecka and Zawadzka (2020), that the reason that this category tend to have loan is to fulfill household necessities.
About people’s decision on subscribing term deposits, we can say that it is influenced by the quality of marketing campaign and whether he has a loan or not. It is proven that statistically, having loan is correlated with subscribing term deposits; and knowing whether someone has a loan or not can help us in predicting whether he has subscribed to a term deposit or not. As for the marketing campaign, the number of contacts performed to a particular client during a marketing campaign was unable to explain whether a client was interested in term deposit subscription or not. On the contrary, the contact duration to a particular client, was able to explain clients’ interest on subscribing term deposit.