1 Background

1.1 Brief

Bank is a financial institution licensed to receive deposits and make loans. It may also provides financial services such as currency exchange and term deposit subscriptions. As a financial institution, it is important for bank to have some insight on what factors affect the customers’ choices to use their financial services. In this document, I am going to analyze a bank marketing data to gain some information on what influences people to make loans and term deposit subscription. The data is related with direct marketing campaigns based on phone calls of a Portuguese banking institution. I know that banking institutions also need something that allows them to classify customers, i.e. knowing which customer will be interested on their product, and the dataset was created for a classifying task. However, since it is out of scope, I will not perform any modeling here. If you really want to try to construct a classifier for this dataset, I would recommend that you split the data into train and test set, then create a model using Logistic Regression.

1.2 Libraries and Setup

These following packages are required in this notebook. Use install.packages() to install any packages that are not already downloaded and load them using library() function. I provided a brief explanation about their function.

reshape2: transform data between wide and long format
tigerstats: elementary statistics

library(reshape2)
library(tigerstats)

2 Data Preparation

2.1 Data Inspection

bank <- read.csv("data_input/bank-full.csv", sep = ";", stringsAsFactors = T)
head(bank)

##   age          job marital education default balance housing loan contact day
## 1  58   management married  tertiary      no    2143     yes   no unknown   5
## 2  44   technician  single secondary      no      29     yes   no unknown   5
## 3  33 entrepreneur married secondary      no       2     yes  yes unknown   5
## 4  47  blue-collar married   unknown      no    1506     yes   no unknown   5
## 5  33      unknown  single   unknown      no       1      no   no unknown   5
## 6  35   management married  tertiary      no     231     yes   no unknown   5
##   month duration campaign pdays previous poutcome  y
## 1   may      261        1    -1        0  unknown no
## 2   may      151        1    -1        0  unknown no
## 3   may       76        1    -1        0  unknown no
## 4   may       92        1    -1        0  unknown no
## 5   may      198        1    -1        0  unknown no
## 6   may      139        1    -1        0  unknown no

Information about the features:

age: client’s age
job : type of job
marital : marital status
education: client’s last education
default: does the client have credit in default?
balance: average yearly balance, in euros
housing: has housing loan?
loan: does he client have personal loan?
contact: contact communication type
day: last contact day of the month
month: last contact month of year
duration: last contact duration, in seconds
campaign: number of contacts performed during this campaign and for this client
pdays: number of days that passed by after the client was last contacted from a previous campaign (-1 means client was not previously contacted)
previous: number of contacts performed before this campaign and for this client
poutcome: outcome of the previous marketing campaign
y: has the client subscribed a term deposit?

dim(bank)

## [1] 45211    17

We can see that there are 45.211 instances and 17 features in this dataset. Now, let’s check the data types.

str(bank)

## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

All columns already have the correct data types. However, I am concerned about the day and month features. There is no information about whether the data was collected in the same year or not. If the data was collected not in the same year, those features will be meaningless, because the exact dates are not provided. For now, I will drop the day and month features.

bank <- subset(bank, select = -c(day, month))
head(bank)

##   age          job marital education default balance housing loan contact
## 1  58   management married  tertiary      no    2143     yes   no unknown
## 2  44   technician  single secondary      no      29     yes   no unknown
## 3  33 entrepreneur married secondary      no       2     yes  yes unknown
## 4  47  blue-collar married   unknown      no    1506     yes   no unknown
## 5  33      unknown  single   unknown      no       1      no   no unknown
## 6  35   management married  tertiary      no     231     yes   no unknown
##   duration campaign pdays previous poutcome  y
## 1      261        1    -1        0  unknown no
## 2      151        1    -1        0  unknown no
## 3       76        1    -1        0  unknown no
## 4       92        1    -1        0  unknown no
## 5      198        1    -1        0  unknown no
## 6      139        1    -1        0  unknown no

2.2 Data Cleaning

Checking the missing values in each column.

colSums(is.na(bank))

##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact  duration  campaign     pdays  previous  poutcome         y 
##         0         0         0         0         0         0         0

The data seems to not have any missing values. This allows us to proceed to the analysis part without needing any data imputation. However, I am a bit concerned about poutcome feature. Actually, There are 2 features: contact and poutcome, which have “unknown” values. I will first change all unknown values into NULL, and recheck the missing values.

bank[bank == "unknown"] <- NA
colSums(is.na(bank))

##       age       job   marital education   default   balance   housing      loan 
##         0       288         0      1857         0         0         0         0 
##   contact  duration  campaign     pdays  previous  poutcome         y 
##     13020         0         0         0         0     36959         0

Now it is clear that the data has some missing values. First, let’s check on contact feature.

table(bank$contact)

## 
##  cellular telephone   unknown 
##     29285      2906         0

For contact feature, I do not think that the “unknown” values are important, since it does not really matter though which media we contact people when doing a marketing campaign, as long as we are able to reach and communicate with our potential customers.

table(bank$poutcome)

## 
## failure   other success unknown 
##    4901    1840    1511       0

The poutcome feature, however, seems pretty important. It tells us whether the previous marketing campaign succeeded or not. There are too many “unknown” values in this feature. If I were to impute the values, I am afraid the result will not be good either. So, I will drop both contact and poutcome feature, and drop all the rows containing missing values in job and education, since it only contains a little missing values.

bank <- subset(bank, select = -c(contact, poutcome))
bank <- na.omit(bank)
bank <- droplevels.data.frame(bank)

3 Exploratory Data Analysis

Here, I have constructed several questions that will help analyze the data. Generally, the questions can be classified to get several kinds of information:

characteristics of people who tend to have loans,
relation between loans and term deposit subscriptions,
the effects of marketing campaign on term deposit subscription.

3.1 Characteristics of People Who Tend to Have Loans

Firstly, let’s take a look at the distribution of age.

bank.loan <- bank[bank$loan == "yes" | bank$housing == "yes", ]
bank.no.loan <- bank[bank$loan == "no" & bank$housing == "no", ]

hist(bank.loan$age, 
     col = "indianred",
     xlab = "Age", 
     main = "Distribution of Ages")

hist(bank.no.loan$age, 
     col = "skyblue4", 
     add = TRUE)

legend("topright", 
       c("Loan", "No Loan"), 
       col = c("indianred", "skyblue4"), 
       lwd = 5)

From the plot, we cannot conclude that people between 30 and 35 years old tend to have some loans, although it has the highest frequency. This is because the height of the distribution (i.e. the high frequency value) comes from the data itself. It simply means that the majority of the data consists of people between 30-35 years old. Instead, we can see the tendency towards having a loan from the difference between those 2 histogram (the distribution of people with loans and no loans) height. So, from the plot, we can conclude that people at age 30-40 tend to have loans, regardless of whether it is a housing loan or a personal loan. We can also see that beyond 40 years old, the tendency of having loans gradually decreases. This corresponds to the research on why households borrow money, which stated “the category of lack of debt applied to older people, whose main source of income was retirement and disability pensions” (Strzelecka and Zawadzka, 2020).

Next, I am interested in knowing what types of job tend to have loans. Before that, let’s take a look at the available types of jobs in the dataset.

table(bank$job)

## 
##        admin.   blue-collar  entrepreneur     housemaid    management 
##          5000          9278          1411          1195          9216 
##       retired self-employed      services       student    technician 
##          2145          1540          4004           775          7355 
##    unemployed 
##          1274

Now, let’s see what types of job that tend to have loans. I will create a new column called any.loan that stores binary values:

“yes” if a person has either a housing loan or personal loan,
“no” if a person does not have any loan.

# Combine housing and loan
bank$any.loan <- paste(bank$housing, bank$loan)

# Return "no" if a person does not have both housing loan and personal loan
yn <- function(x){
  if(x == "no no"){
    x <- "no"
  } else {
    x <- "yes"
  }
}

# Apply the function
bank$any.loan <- sapply(bank$any.loan, yn)

Creating a dataframe consists of the percentage of people who have loans for each type of job.

job.table <- as.data.frame.matrix(xtabs(~ job + any.loan,
                                        data = bank))

job.table <- round(job.table / rowSums(job.table), 2)

job.table <- subset(job.table[order(job.table$yes, decreasing = T), ], 
                    select = "yes")

colnames(job.table) <- "Percentage of People with Loans"

job.table

##               Percentage of People with Loans
## blue-collar                              0.79
## services                                 0.74
## admin.                                   0.70
## entrepreneur                             0.69
## technician                               0.61
## management                               0.55
## self-employed                            0.55
## unemployed                               0.45
## housemaid                                0.40
## retired                                  0.32
## student                                  0.31

As we can see from the table above, 78% of blue-collar workers have loans. Perhaps it is because of the low revenue. Entrepreneurs having loan are pretty logical, since they need a lot of money to start their own business. There is an interesting thing here. Why are there only 45% unemployed people who have loans? I personally thought that unemployed people tend to borrow money for their daily necessities. Now, it seems like people tend to borrow money for working, but it does not seem like it because of jobs that does not need to use our own capital, such as admin and management. Let’s add a new feature to the table, balance, to gain some more insight.

job.agg <- aggregate(balance ~ job + any.loan,
                     data = bank,
                     FUN = mean)

dcast(job.agg, job ~ any.loan)

##              job       no       yes
## 1         admin. 1522.025  945.1443
## 2    blue-collar 1530.309  957.5999
## 3   entrepreneur 1999.530 1300.1508
## 4      housemaid 1533.990 1038.5166
## 5     management 2052.073 1538.0472
## 6        retired 2410.072 1145.3678
## 7  self-employed 2122.139 1264.6686
## 8       services 1084.511  932.2357
## 9        student 1453.854 1184.5455
## 10    technician 1350.302 1161.7962
## 11    unemployed 1830.785 1148.4021

It is now clear what really affects people in borrowing money. It is not a matter of what kind of job do people have, but rather how much balance do people have in average every year. I am interested in finding the threshold, i.e. in which range of yearly average balance people tend to borrow money.

plot(density(bank.loan$balance), 
     col = "goldenrod",
     xlab = "Average Yearly Balance (Euros)",
     lwd = 2,
     main = "Distribution of Balance")

lines(density(bank.no.loan$balance),
      lwd = 2,
      col = "dodgerblue")

legend("topright", 
       c("Loan", "No Loan"), 
       col = c("goldenrod", "dodgerblue"), 
       lwd = 5)

The plot turns out to be very skewed. Seems like there are many outliers in the dataset. Let’s focus on the range where most of the data lies.

plot(density(bank.loan$balance, from = -2000, to = 3000), 
     col = "goldenrod",
     xlab = "Average Yearly Balance (Euros)",
     lwd = 2,
     main = "Distribution of Balance")

lines(density(bank.no.loan$balance),
      lwd = 2,
      col = "dodgerblue")

legend("topright", 
       c("Loan", "No Loan"), 
       col = c("goldenrod", "dodgerblue"), 
       lwd = 5)

Here, we can see that people who tend to have loans are those whose average yearly balance is between 0-500 Euros. This makes sense since low average yearly balances forces people to borrow money in order to fulfill their daily necessities. However, an important thing not to miss here is the fact that some people with negative average yearly balance does not have any loan. Perhaps it is because some of the samples in the data consists of students. Let’s try to exclude the students and see the result.

plot(density(bank.loan[bank.loan$job != "student", ]$balance, from = -2000, to = 3000), 
     col = "goldenrod",
     xlab = "Average Yearly Balance (Euros)",
     lwd = 2,
     main = "Distribution of Balance")

lines(density(bank.no.loan[bank.no.loan$job != "student", ]$balance),
      lwd = 2,
      col = "dodgerblue")

legend("topright", 
       c("Loan", "No Loan"), 
       col = c("goldenrod", "dodgerblue"), 
       lwd = 5)

Removing the students does not change the fact that some people with negative average yearly balance does not have a loan. Further research about this case may be conducted to see the exact reason of why this event could happen. Now, we will see whether education level affects people’s tendency of having loans.

bank.edu <- bank[bank$education != "unknown", ]
bank.edu$education <- droplevels(bank.edu$education)

barplot(xtabs( ~ any.loan + education,
               data = bank.edu),
        col = c("indianred", "skyblue4"),
        main = "Loan Tendency in Each Educational Level",
        xlab = "Educational Level",
        ylab = "Frequency",
        beside = TRUE)

legend("topright", 
       c("Loan", "No Loan"), 
       col = c("skyblue4", "indianred"), 
       lwd = 5)

Here in any educational level, the number of people who have loans are higher than those who don’t. The tendency on having loans might seem very obvious for people with secondary education. However, I personally think that educational level will be related to the job and the average yearly balance. I am pretty curious in finding what do people spend the borrowed money on for each type of education level. But, the data does not provide any information about it, so for now I will just leave it be.

3.1.1 Key Informations

The main factors that affect people on having loans are age and average yearly balance.
The analysis result corresponds with the research on Socio-Economic Factors Affecting Households Debts conducted by Agnieszka Strzelecka and Danuta Zawadzka.
People at age 30-40 tend to have loans, regardless of whether it is a housing loan or a personal loan. Beyond that range, the tendency of having loans gradually decreases.
People with low average yearly balance (0-500 Euros) tend to have loans. According to Strzelecka and Zawadzka (2020), the reason that this category tend to have loan is to fulfill household necessities.

3.2 Relations between Loans and Term Deposits Subscription

The next thing I am interested in finding is the relation between loans and term deposits subscription. Firstly, let’s see whether having loans affect people’s decision on subscribing term deposits.

rowPerc(xtabs(~ any.loan + y,
              data = bank))

##         y
## any.loan     no    yes  Total
##      no   81.78  18.22 100.00
##      yes  92.29   7.71 100.00

It seems like the majority of people does not have term deposit subscription, no matter whether they have loans or not. However, we can not directly conclude that there is no relationship between these 2 features, since the percentage of people with no loans subscribing to term deposits is twice more than those with loans. We can use the Chi-Square Independence Test to see whether there is any relation between 2 categorical features, with the following hypothesis:

\(H_0\): the variables are independent. There is no relationship between these categorical variables.
\(H_1\): the features are dependent. There is a relationship between these categorical variables.

chisq.test(table(bank$any.loan, bank$y), correct = F)

## 
##  Pearson's Chi-squared test
## 
## data:  table(bank$any.loan, bank$y)
## X-squared = 1085.9, df = 1, p-value < 2.2e-16

We get a p-value of \(2.2 \cdot 10^{-16}\). So, we can conclude that statistically, having loan is correlated with subscribing term deposits; and knowing whether someone has a loan or not can help us to predict whether he has subscribed to a term deposit or not.

3.3 How Marketing Campaigns Influence People on Subscribing Term Deposits

I personally think that to know the effects of marketing campaigns on people’s decision to subscribe term deposit, we can use 2 features in the data: the number of contacts performed for each client during the campaign, and the duration of the last contact performed. For the latter one, I expect that longer contact duration means that the client is more interested in subscribing term deposits. Firstly, let’s see whether the number of contacts performed to a particular client during marketing campaign have any effect on term deposit subscriptions.

plot(density(bank[bank$y == "yes", ]$campaign, from = 0, to = 10), 
     col = "goldenrod",
     xlab = "Number of Contacts Performed",
     lwd = 2,
     main = "Distribution of Number of Contacts Performed During Campaign")

lines(density(bank[bank$y == "no", ]$campaign, from = 0, to = 10),
      lwd = 2,
      col = "dodgerblue")

legend("topright", 
       c("Subscribed", "Unsubscribed"), 
       col = c("goldenrod", "dodgerblue"), 
       lwd = 5)

The pattern seems to be random, so it seems we are unable to extract any useful information. Let’s now check on the contact duration, I will construct a box plot to see whether there is any difference in the contact duration distribution for those who have subscribed term deposits and those who have not. The data have many outliers, so I will exclude them to make the box plot clearer to see.

boxplot(duration ~ y,
        data = bank,
        outline = F,
        main = "Contact Duration Distribution On Term Deposit Subscription",
        xlab = "Term Deposit Subscription",
        ylab = "Duration",
        col = c("indianred", "skyblue4"))

As expected, although the number of contacts performed seems to have a random pattern, the duration of last contact performed seems to be able to explain whether a client is interested in subscribing term deposits or not. Clients who are interested in term deposit subscription tend to have longer contact duration, I assume that it is because they asked a lot about the product details, e.g. the return or fee.

3.4 Conclusion

The analysis result on what influences people to have loans corresponds with the research on Socio-Economic Factors Affecting Households Debts conducted by Agnieszka Strzelecka and Danuta Zawadzka. Based on the analysis, the main factors that affect people on having loans are age and average yearly balance. People at age 30-40 tend to have loans, regardless of whether it is a housing loan or a personal loan. Beyond that range, the tendency of having loans gradually decreases. People with low average yearly balance (0-500 Euros) tend to have loans. It is stated in the research conducted by Strzelecka and Zawadzka (2020), that the reason that this category tend to have loan is to fulfill household necessities.

About people’s decision on subscribing term deposits, we can say that it is influenced by the quality of marketing campaign and whether he has a loan or not. It is proven that statistically, having loan is correlated with subscribing term deposits; and knowing whether someone has a loan or not can help us in predicting whether he has subscribed to a term deposit or not. As for the marketing campaign, the number of contacts performed to a particular client during a marketing campaign was unable to explain whether a client was interested in term deposit subscription or not. On the contrary, the contact duration to a particular client, was able to explain clients’ interest on subscribing term deposit.

3.5 References

Strzelecka, Agnieszka & Zawadzka, Danuta. (2020). Why Households Borrow Money? Socio-Economic Factors Affecting Households Debts: A Model Approach. EUROPEAN RESEARCH STUDIES JOURNAL. XXIII. 820-839. 10.35808/ersj/1900.

Exploratory Data Analysis On Bank Marketing Data

Stane Aurelius Ronotana

13 August 2021