In this Exploratory Data Analysis, I’ll use the data from a Portuguese banking institution, about marketing campaigns based on phone calls (Telemarketing).
This data contains 4521 observations with 17 variables. Main variables are age, type of job, marital status, and education. Most of the variables are categorical, for instance, education, whether the client had finished primary, secondary, or tertiary education. An in-depth explanation about the variables will be given at the next section.
This csv file is separated by semi-colon, so we need to include an argument “sep =” since the default separator is a comma.
The data is stored to a variable named “client”.
First we check the structure of the data.
## 'data.frame': 4521 obs. of 17 variables:
## $ age : int 30 33 35 30 59 35 36 39 41 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
## $ housing : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
## $ day : int 19 11 16 3 5 23 14 6 14 17 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
## $ duration : int 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
## $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
## $ previous : int 0 4 1 0 0 3 2 0 0 2 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
The highest level of the variables is 12, so there is no need to convert the data type. Next, we check the missing value from the data.
## age job marital education default balance housing loan
## 0 0 0 0 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 0 0 0 0 0 0 0 0
## y
## 0
As we can see, there is no missing value in the data. That means the data is already clean and ready to use.
Before we further analyze the data, we need to inspect the dimension, variables, etc.
In order to see a glance of how the data looks like, we can display the first 6 and last 6 observations from the data.
The purpose is to see a little bit of how the data actually looks like.
Next up, we can go through the structure and variables of the data.
## 'data.frame': 4521 obs. of 17 variables:
## $ age : int 30 33 35 30 59 35 36 39 41 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
## $ housing : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
## $ day : int 19 11 16 3 5 23 14 6 14 17 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
## $ duration : int 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
## $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
## $ previous : int 0 4 1 0 0 3 2 0 0 2 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
The data has 4521 rows and 17 columns. Here is the information of each column:
From the explanation above, we can decide that some of the variables will not be used in further analysis.
Unused variables need to be removed to simplify the data. Those variables are “default”, “day”, “month”, “duration”, “campaign”, “pdays”, “previous”, “poutcome”, and “y”. The main reason for removing these variables as it is hugely useful for classification and machine learning, but barely useful for exploratory data analysis.
# in order to subset data, we need to use the column index
# to get multiple column index, we use "match()" function
match(c("default","day","month","duration","campaign","pdays","previous","poutcome","y"),names(client))## [1] 5 10 11 12 13 14 15 16 17
Now we re-check the data by its preview.
First of all, we need to see a brief overview of the data.
## age job marital education
## Min. :19.00 management :969 divorced: 528 primary : 678
## 1st Qu.:33.00 blue-collar:946 married :2797 secondary:2306
## Median :39.00 technician :768 single :1196 tertiary :1350
## Mean :41.17 admin. :478 unknown : 187
## 3rd Qu.:49.00 services :417
## Max. :87.00 retired :230
## (Other) :713
## balance housing loan contact
## Min. :-3313 no :1962 no :3830 cellular :2896
## 1st Qu.: 69 yes:2559 yes: 691 telephone: 301
## Median : 444 unknown :1324
## Mean : 1423
## 3rd Qu.: 1480
## Max. :71188
##
Summary:
For numeric attributes, analysis goes through boxplot, to see the data skewness to which side.
To see whether there are much of outliers or not, we use boxplot.
As we can see, the data is massively irregular. It has lots of outliers, especially upper outlier. Now let’s check the number of outliers.
# First, we calculate the minimum value from boxplot
min_plot <- as.numeric(quantile(client$balance)[2]-(1.5*IQR(client$balance)))
# Then, the maximum value
max_plot <- as.numeric(quantile(client$balance)[4]+(1.5*IQR(client$balance)))
# quantile() returns the value of min, Q1, median, Q3, and max.So the minimum and maximum value of the boxplot are:
## [1] "Minimum Value : -2047.5"
## [1] "Maximum Value : 3596.5"
Now we count how many outliers there actually are
lower_outlier <- client[client$balance<min_plot,]
upper_outlier <- client[client$balance>max_plot,]
above_q3 <- client[client$balance>quantile(client$balance)[4],]
below_q1 <- client[client$balance<quantile(client$balance)[2],]Number of Outliers :
## [1] "Lower Outlier : 2"
## [1] "Upper Outlier : 504"
Quartiles
## [1] "Below Q1 : 1129"
## [1] "Above Q3 : 1129"
Data below Q1 and above Q3 are equal, so to determine the data skewness, we check the outliers.
There are only 2 lower outlier, while there are 504 upper outliers. So the “yearly balance” data are mostly spreaded over the 3rd quartile (top 25%)
For categorical attributes, the variable analysis we can do is data sorting.
Job Type
##
## unknown student housemaid unemployed entrepreneur
## 38 84 112 128 168
## self-employed retired services admin. technician
## 183 230 417 478 768
## blue-collar management
## 946 969
Marital Status
##
## divorced single married
## 528 1196 2797
Education
##
## unknown primary tertiary secondary
## 187 678 1350 2306
Housing Loan
##
## no yes
## 1962 2559
Personal Loan
##
## yes no
## 691 3830
Contact Type
##
## telephone unknown cellular
## 301 1324 2896
age_edu <- aggregate(formula=balance~education,data=client,FUN = median)
age_edu[age_edu$balance==max(age_edu$balance),]So the client with tertiary education degree tends to have a higher yearly balance.
## [1] 2869
## [1] 1652
Yes, most of the clients are aged above 35, with 1217 differences.
No, Client who is single slightly has a higher balance than married client. However, divorced client has much lower balance than the other two.
job_balance <- aggregate(balance~job,client,median)
job_balance[order(job_balance$balance,decreasing = T),]Retired clients have the highest balance.
num_loan <- nrow(client[client$housing=="yes" | client$loan=="yes",])
no_loan <- nrow(client)-num_loan
round((num_loan/nrow(client))*100,digits = 2)## [1] 62.91
Yes, most of the clients at least having a loan.
Clients using telephone tend to have a higher balance.
Clients with no loans tend to have a higher balance.
## [1] 0.08382014
There is a very small positive correlation between age and balance, almost no correlation.
Total balance of clients in management are the highest.
# Creating mode function
mode_count <- function(x){
data_table <- -table(x)
data_table <- sort(data_table)
print(paste("Mode :",names(data_table[1])))
print(paste("Occurences :",-data_table[1]))
}
# Calling mode function
mode_count(client$age)## [1] "Mode : 34"
## [1] "Occurences : 231"
Most of the clients are aged 34, with 231 occurences.
##
## primary secondary tertiary unknown
## divorced 79 270 155 24
## married 526 1427 727 117
## single 73 609 468 46
Most of the clients have married and finished secondary education.
There are 371 clients whose balance is below 3000 and having both housing and personal loan.
129 of the clients are aged below 30 and married.
There are 41 clients who are aged above 40, yearly balance above 10000, and married.
There are 88 clients who work as technician, married, and finished tertiary education.
# Creating percentage function
percentage <- function(x){
x <- x*100
x <- round(x,digits = 2)
print(x)
}Job
##
## admin. blue-collar entrepreneur housemaid management
## 10.57 20.92 3.72 2.48 21.43
## retired self-employed services student technician
## 5.09 4.05 9.22 1.86 16.99
## unemployed unknown
## 2.83 0.84
Marital Status
##
## divorced married single
## 11.68 61.87 26.45
Education
##
## primary secondary tertiary unknown
## 15.00 51.01 29.86 4.14
Housing Loan
##
## no yes
## 43.4 56.6
Personal Loan
##
## no yes
## 84.72 15.28
Contact Type
##
## cellular telephone unknown
## 64.06 6.66 29.29