1 Data Explanation

1.1 Source

In this Exploratory Data Analysis, I’ll use the data from a Portuguese banking institution, about marketing campaigns based on phone calls (Telemarketing).

Link : Portuguese Bank Telemarketing

1.2 Brief Attribute Explanation

This data contains 4521 observations with 17 variables. Main variables are age, type of job, marital status, and education. Most of the variables are categorical, for instance, education, whether the client had finished primary, secondary, or tertiary education. An in-depth explanation about the variables will be given at the next section.

2 Load and Cleanse Data

2.1 Load Data

This csv file is separated by semi-colon, so we need to include an argument “sep =” since the default separator is a comma.

client <- read.csv(file = "data_input/bank.csv",sep = ";")

The data is stored to a variable named “client”.

2.2 Data Cleansing

First we check the structure of the data.

str(client)

## 'data.frame':    4521 obs. of  17 variables:
##  $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
##  $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
##  $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
##  $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
##  $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

The highest level of the variables is 12, so there is no need to convert the data type. Next, we check the missing value from the data.

colSums(is.na(client))

##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0

As we can see, there is no missing value in the data. That means the data is already clean and ready to use.

3 Data Inspection

Before we further analyze the data, we need to inspect the dimension, variables, etc.

3.1 Data Preview

In order to see a glance of how the data looks like, we can display the first 6 and last 6 observations from the data.

head(client)

tail(client)

The purpose is to see a little bit of how the data actually looks like.

3.2 Structure and Variable

Next up, we can go through the structure and variables of the data.

str(client)

## 'data.frame':    4521 obs. of  17 variables:
##  $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
##  $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
##  $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
##  $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
##  $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

The data has 4521 rows and 17 columns. Here is the information of each column:

age: Integer data type of client’s age
job: There are 12 different type of jobs, including “student”, “retired”, “unemployed”, and “self-employed”.
marital: Marital status, whether the client is divorced, married, or single.
education: The last education finished by the client, whether primary, secondary, tertiary, or unknown.
default: The credit status.
balance: Average yearly balance, in euros.
housing: If the client have housing loan or not
loan: If the client have personal loan or not
contact: contact communication type
day: last contact day of the month
month: last contact month of year
duration: last contact duration, in seconds
campaign: number of contacts performed during this campaign and for this client
pdays: number of days that passed by after the client was last contacted from a previous campaign (-1 means client was not previously contacted)
previous: number of contacts performed before this campaign and for this client
poutcome: outcome of the previous marketing campaign
y: has the client subscribed a term deposit?

From the explanation above, we can decide that some of the variables will not be used in further analysis.

3.3 Deleting Unused Variable

head(client)

Unused variables need to be removed to simplify the data. Those variables are “default”, “day”, “month”, “duration”, “campaign”, “pdays”, “previous”, “poutcome”, and “y”. The main reason for removing these variables as it is hugely useful for classification and machine learning, but barely useful for exploratory data analysis.

# in order to subset data, we need to use the column index
# to get multiple column index, we use "match()" function
match(c("default","day","month","duration","campaign","pdays","previous","poutcome","y"),names(client))

## [1]  5 10 11 12 13 14 15 16 17

client <- client[,-c(5,10,11,12,13,14,15,16,17)]

Now we re-check the data by its preview.

head(client)

4 Data Analysis

4.1 Summary

First of all, we need to see a brief overview of the data.

summary(client)

##       age                 job          marital         education   
##  Min.   :19.00   management :969   divorced: 528   primary  : 678  
##  1st Qu.:33.00   blue-collar:946   married :2797   secondary:2306  
##  Median :39.00   technician :768   single  :1196   tertiary :1350  
##  Mean   :41.17   admin.     :478                   unknown  : 187  
##  3rd Qu.:49.00   services   :417                                   
##  Max.   :87.00   retired    :230                                   
##                  (Other)    :713                                   
##     balance      housing     loan           contact    
##  Min.   :-3313   no :1962   no :3830   cellular :2896  
##  1st Qu.:   69   yes:2559   yes: 691   telephone: 301  
##  Median :  444                         unknown  :1324  
##  Mean   : 1423                                         
##  3rd Qu.: 1480                                         
##  Max.   :71188                                         
##

Summary:

Average age of the client is 41.
The bank has a long span of client, from 19 years old to 87 years old (range = 68).
Most of the clients work in management, blue-collar, or technician.
Most of the clients have married.
Most of the clients had finished secondary education.
The yearly balance data has outliers, both upper and lower.
Most of the clients have either housing or personal loan.
Most of the clients use cellular contact type.

4.2 Variable Analysis

4.2.1 Numeric Attributes

For numeric attributes, analysis goes through boxplot, to see the data skewness to which side.

4.2.1.1 Balance

To see whether there are much of outliers or not, we use boxplot.

boxplot(client$balance)

As we can see, the data is massively irregular. It has lots of outliers, especially upper outlier. Now let’s check the number of outliers.

# First, we calculate the minimum value from boxplot
min_plot <- as.numeric(quantile(client$balance)[2]-(1.5*IQR(client$balance)))
# Then, the maximum value
max_plot <- as.numeric(quantile(client$balance)[4]+(1.5*IQR(client$balance)))
# quantile() returns the value of min, Q1, median, Q3, and max.

So the minimum and maximum value of the boxplot are:

## [1] "Minimum Value : -2047.5"

## [1] "Maximum Value : 3596.5"

Now we count how many outliers there actually are

lower_outlier <- client[client$balance<min_plot,]
upper_outlier <- client[client$balance>max_plot,]
above_q3 <- client[client$balance>quantile(client$balance)[4],]
below_q1 <- client[client$balance<quantile(client$balance)[2],]

Number of Outliers :

## [1] "Lower Outlier : 2"

## [1] "Upper Outlier : 504"

Quartiles

## [1] "Below Q1 : 1129"

## [1] "Above Q3 : 1129"

Data below Q1 and above Q3 are equal, so to determine the data skewness, we check the outliers.

There are only 2 lower outlier, while there are 504 upper outliers. So the “yearly balance” data are mostly spreaded over the 3rd quartile (top 25%)

4.2.1.2 Age

boxplot(client$age)

The age is spreaded mostly over the median.

4.2.2 Categorical Attributes

For categorical attributes, the variable analysis we can do is data sorting.

Job Type

## 
##       unknown       student     housemaid    unemployed  entrepreneur 
##            38            84           112           128           168 
## self-employed       retired      services        admin.    technician 
##           183           230           417           478           768 
##   blue-collar    management 
##           946           969

Marital Status

## 
## divorced   single  married 
##      528     1196     2797

Education

## 
##   unknown   primary  tertiary secondary 
##       187       678      1350      2306

Housing Loan

## 
##   no  yes 
## 1962 2559

Personal Loan

## 
##  yes   no 
##  691 3830

Contact Type

## 
## telephone   unknown  cellular 
##       301      1324      2896

5 Data Manipulation & Transformation

Which final education degree has the highest yearly balance (in average)?

age_edu <- aggregate(formula=balance~education,data=client,FUN = median)
age_edu[age_edu$balance==max(age_edu$balance),]

So the client with tertiary education degree tends to have a higher yearly balance.

Are most of the clients aged above 35?

nrow(client[client$age>35,])

## [1] 2869

nrow(client[client$age<=35,])

## [1] 1652

Yes, most of the clients are aged above 35, with 1217 differences.

Does married client tends to have a higher yearly balance?

# We use median because the data has a lot of outliers
aggregate(balance~marital,client,median)

No, Client who is single slightly has a higher balance than married client. However, divorced client has much lower balance than the other two.

Which job produces the highest balance average?

job_balance <- aggregate(balance~job,client,median)
job_balance[order(job_balance$balance,decreasing = T),]

Retired clients have the highest balance.

Do most of the clients at least having a loan?

num_loan <- nrow(client[client$housing=="yes" | client$loan=="yes",])
no_loan <- nrow(client)-num_loan
round((num_loan/nrow(client))*100,digits = 2)

## [1] 62.91

Yes, most of the clients at least having a loan.

Does contact type has any relation with balance?

aggregate(balance~contact,client,median)

Clients using telephone tend to have a higher balance.

Is there any relation between balance and loan taken?

aggregate(balance~housing,client,median)

aggregate(balance~loan,client,median)

Clients with no loans tend to have a higher balance.

Is there any correlation between age and balance?

cor(client$age,client$balance)

## [1] 0.08382014

There is a very small positive correlation between age and balance, almost no correlation.

Which job has the highest total balance

job_total <- aggregate(balance~job,client,sum)
job_total[order(job_total$balance,decreasing = T),]

Total balance of clients in management are the highest.

What certain age is most of the clients in?

# Creating mode function

mode_count <- function(x){
  data_table <- -table(x)
  data_table <- sort(data_table)
  print(paste("Mode :",names(data_table[1])))
  print(paste("Occurences :",-data_table[1]))
}

# Calling mode function
mode_count(client$age)

## [1] "Mode : 34"
## [1] "Occurences : 231"

Most of the clients are aged 34, with 231 occurences.

Which combination of marital status and education are most of the clients in?

table(client$marital,client$education)

##           
##            primary secondary tertiary unknown
##   divorced      79       270      155      24
##   married      526      1427      727     117
##   single        73       609      468      46

Most of the clients have married and finished secondary education.

How many of the clients with balance below 3000 and having both housing and personal loan?

client[client$balance<3000 & client$housing=="yes" & client$loan == "yes",]

There are 371 clients whose balance is below 3000 and having both housing and personal loan.

How many of the clients aged below 30 that has married?

client[client$age<30 & client$marital=="married",]

129 of the clients are aged below 30 and married.

How many of the clients aged above 40, having a yearly balance above 10000, and married?

client[client$age>40 & client$balance>10000 & client$marital=="married",]

There are 41 clients who are aged above 40, yearly balance above 10000, and married.

How many clients have finished tertiary education, married, and work as a technician?

client[client$education=="tertiary" & client$marital=="married" & client$job=="technician",]

There are 88 clients who work as technician, married, and finished tertiary education.

Display distribution percentages for each categorical variable.

# Creating percentage function
percentage <- function(x){
  x <- x*100
  x <- round(x,digits = 2)
  print(x)
}

Job

percentage(prop.table(table(client$job)))

## 
##        admin.   blue-collar  entrepreneur     housemaid    management 
##         10.57         20.92          3.72          2.48         21.43 
##       retired self-employed      services       student    technician 
##          5.09          4.05          9.22          1.86         16.99 
##    unemployed       unknown 
##          2.83          0.84