The dataset to be worked with can be found in Bank Marketing
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.
Bank Marketing provides many version of data, but the data that will be used in this project will be the ‘bank.xlsx’ version.
library(readxl)
bank <- read_excel("D:/WRITE IT RIGHT 2/bank-marketing/bank.xlsx")
head(bank)
## # A tibble: 6 × 17
## age job marital education default balance housing loan contact day
## <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 30 unemployed married primary no 1787 no no cellul… 19
## 2 33 services married secondary no 4789 yes yes cellul… 11
## 3 35 management single tertiary no 1350 yes no cellul… 16
## 4 30 management married tertiary no 1476 yes yes unknown 3
## 5 59 blue-coll… married secondary no 0 yes no unknown 5
## 6 35 management single tertiary no 747 no no cellul… 23
## # ℹ 7 more variables: month <chr>, duration <dbl>, campaign <dbl>, pdays <dbl>,
## # previous <dbl>, poutcome <chr>, y <chr>
tail(bank)
## # A tibble: 6 × 17
## age job marital education default balance housing loan contact day
## <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 32 services single secondary no 473 yes no cellul… 7
## 2 33 services married secondary no -333 yes no cellul… 30
## 3 57 self-empl… married tertiary yes -3313 yes yes unknown 9
## 4 57 technician married secondary no 295 no no cellul… 19
## 5 28 blue-coll… married secondary no 1137 no no cellul… 6
## 6 44 entrepren… single tertiary no 1136 yes yes cellul… 3
## # ℹ 7 more variables: month <chr>, duration <dbl>, campaign <dbl>, pdays <dbl>,
## # previous <dbl>, poutcome <chr>, y <chr>
dim(bank)
## [1] 4521 17
names(bank)
## [1] "age" "job" "marital" "education" "default" "balance"
## [7] "housing" "loan" "contact" "day" "month" "duration"
## [13] "campaign" "pdays" "previous" "poutcome" "y"
From the inspection, we can conclude:
* bank data contain 4521 rows with 17 columns
* Each columns name: “age”, “job”, “marital”, “education”, “default”,
“balance”, “housing”, “loan”, “contact”, “day”, “month”, “duration”,
“campaign”, “pdays”, “previous”, “poutcome”, and “y”
Check data types of each columns
str(bank)
## tibble [4,521 × 17] (S3: tbl_df/tbl/data.frame)
## $ age : num [1:4521] 30 33 35 30 59 35 36 39 41 43 ...
## $ job : chr [1:4521] "unemployed" "services" "management" "management" ...
## $ marital : chr [1:4521] "married" "married" "single" "married" ...
## $ education: chr [1:4521] "primary" "secondary" "tertiary" "tertiary" ...
## $ default : chr [1:4521] "no" "no" "no" "no" ...
## $ balance : num [1:4521] 1787 4789 1350 1476 0 ...
## $ housing : chr [1:4521] "no" "yes" "yes" "yes" ...
## $ loan : chr [1:4521] "no" "yes" "no" "yes" ...
## $ contact : chr [1:4521] "cellular" "cellular" "cellular" "unknown" ...
## $ day : num [1:4521] 19 11 16 3 5 23 14 6 14 17 ...
## $ month : chr [1:4521] "oct" "may" "apr" "jun" ...
## $ duration : num [1:4521] 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign : num [1:4521] 1 1 1 4 1 2 1 2 2 1 ...
## $ pdays : num [1:4521] -1 339 330 -1 -1 176 330 -1 -1 147 ...
## $ previous : num [1:4521] 0 4 1 0 0 3 2 0 0 2 ...
## $ poutcome : chr [1:4521] "unknown" "failure" "failure" "unknown" ...
## $ y : chr [1:4521] "no" "no" "no" "no" ...
From this result, we find some of data type not in the corect type. we need to convert it into correct type (data coertion)
bank$job <- as.factor(bank$job)
bank$marital <- as.factor(bank$marital)
bank$education <- as.factor(bank$education)
bank$default <- as.factor(bank$default)
bank$balance <- as.integer(bank$balance)
bank$housing <- as.factor(bank$housing)
bank$loan <- as.factor(bank$loan)
bank$contact <- as.factor(bank$contact)
bank$month <- as.factor(bank$month)
bank$campaign <- as.integer(bank$campaign)
bank$pdays <- as.integer(bank$pdays)
bank$poutcome <- as.factor(bank$poutcome)
bank$y <- as.factor(bank$y)
str(bank)
## tibble [4,521 × 17] (S3: tbl_df/tbl/data.frame)
## $ age : num [1:4521] 30 33 35 30 59 35 36 39 41 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ balance : int [1:4521] 1787 4789 1350 1476 0 747 307 147 221 -88 ...
## $ housing : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
## $ day : num [1:4521] 19 11 16 3 5 23 14 6 14 17 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
## $ duration : num [1:4521] 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign : int [1:4521] 1 1 1 4 1 2 1 2 2 1 ...
## $ pdays : int [1:4521] -1 339 330 -1 -1 176 330 -1 -1 147 ...
## $ previous : num [1:4521] 0 4 1 0 0 3 2 0 0 2 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
All columns have been updated to the desired data type.
Checking the missing values
colSums(is.na(bank))
## age job marital education default balance housing loan
## 0 0 0 0 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 0 0 0 0 0 0 0 0
## y
## 0
anyNA(bank)
## [1] FALSE
This is great, there is no missing values in the data.
Before we do the exploration, we will subset the data in which only
giving the desired columns and save it with Bank
variable.
The reason why I decided to not include the ‘day’ column is because this dataset didn’t provide more information about the year of the contact. Therefore we will do exploration according to the ‘month’ of the customer contact.
Bank <- bank[,c(1:9, 11, 13, 17)]
head(Bank)
## # A tibble: 6 × 12
## age job marital education default balance housing loan contact month
## <dbl> <fct> <fct> <fct> <fct> <int> <fct> <fct> <fct> <fct>
## 1 30 unemployed married primary no 1787 no no cellul… oct
## 2 33 services married secondary no 4789 yes yes cellul… may
## 3 35 management single tertiary no 1350 yes no cellul… apr
## 4 30 management married tertiary no 1476 yes yes unknown jun
## 5 59 blue-coll… married secondary no 0 yes no unknown may
## 6 35 management single tertiary no 747 no no cellul… feb
## # ℹ 2 more variables: campaign <int>, y <fct>
summary(Bank)
## age job marital education default
## Min. :19.00 management :969 divorced: 528 primary : 678 no :4445
## 1st Qu.:33.00 blue-collar:946 married :2797 secondary:2306 yes: 76
## Median :39.00 technician :768 single :1196 tertiary :1350
## Mean :41.17 admin. :478 unknown : 187
## 3rd Qu.:49.00 services :417
## Max. :87.00 retired :230
## (Other) :713
## balance housing loan contact month
## Min. :-3313 no :1962 no :3830 cellular :2896 may :1398
## 1st Qu.: 69 yes:2559 yes: 691 telephone: 301 jul : 706
## Median : 444 unknown :1324 aug : 633
## Mean : 1423 jun : 531
## 3rd Qu.: 1480 nov : 389
## Max. :71188 apr : 293
## (Other): 571
## campaign y
## Min. : 1.000 no :4000
## 1st Qu.: 1.000 yes: 521
## Median : 2.000
## Mean : 2.794
## 3rd Qu.: 3.000
## Max. :50.000
##
Columns description:
* ‘age’ : the customer age
* ‘job’ : type of the customer job
* ‘marital’ : customer marital status
* ‘education’ : customer education level
* ‘default’ : information if the customer has a credit in default
* ‘balance’ : the average yearly balance
* ‘housing’ : information if the customer has a housing loan
* ‘loan’ : information if the customer has a personal loan
* ‘contact’ : the contact communication type
* ‘month’ : last contact month of year
* ‘campaign’ : number of contacts performed during this campaign and for
this client
* ‘y’ : the target (information if the customer has decided to subscribe
or not)
We will see if there is a correlation between each numeric column to
the target y
library(reshape2)
library(ggplot2)
library(gcookbook)
library(magrittr)
library(dplyr)
numeric_vars <- names(Bank)[sapply(Bank, is.numeric)]
numeric_data <- Bank[, numeric_vars]
numeric_data$target_var <- ifelse(Bank$y == "yes", 1, 0)
cor_matrix <- cor(numeric_data)
cor_melted <- melt(cor_matrix)
ggplot(cor_melted, aes(Var1, Var2, fill = value))+
geom_tile()+
geom_text(aes(label = round(value, 2)), color = "black", size = 3, vjust = 0.5) +
scale_fill_gradient2(high = "darkcyan", midpoint = 0) +
theme_minimal() +
labs(x = "", y = "", title = "Heatmap of Numeric Variables with Target Variable") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_text(angle = 0, hjust = 1))
According to the heatmap above, there is no strong correlation between the numeric factors with the target.
Here we have the distribution of the target.
ggplot(Bank, aes(x = factor(y)))+
geom_bar(fill = "lightgreen")+
geom_text(stat = "count", aes(label = paste0(..count.., " (", scales::percent(..count../sum(..count..)), ")")), vjust = 1.5, colour = "black")+
labs(title = "Number of Subscription Target")
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The target is quite unequal, as shown by the bar chart above. As a result, we’ll make an effort to concentrate on the subscribers’ customers.
aggregate(age~y, Bank, mean)
## y age
## 1 no 40.99800
## 2 yes 42.49136
aggregate(age~y, Bank, sd)
## y age
## 1 no 10.18840
## 2 yes 13.11577
boxplot(Bank$age)
From the results above, we can see that there are possibilities of
outliers, but since the standard deviation is below the mean, we can say
that the age’s of each category of target(y) is quite homogenous.
aggregate(balance~y, Bank, mean)
## y balance
## 1 no 1403.212
## 2 yes 1571.956
aggregate(balance~y, Bank, sd)
## y balance
## 1 no 3075.349
## 2 yes 2444.399
boxplot(Bank$balance)
According to the results above, we can see that there outliers within the balance, other than that, the standard deviation within each category of y are bigger than the mean which also indicate that the range of data is quite heterogeneous.
First, we will set the order of the month.
month_order <- c("jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec")
Bank$month <- factor(Bank$month, levels = month_order)
Now we’ll see the number of contact that happened in each month.
ggplot(Bank, aes(x = month)) +
geom_bar(fill = "lightgreen") +
geom_text(stat = "count",
aes(label = paste0(scales::percent(..count../sum(..count..)))),
vjust = 0.1,
colour = "black") +
labs(title = "Amount of contact according by month")
We see that the month of May had the most contact while the month of December had the least. This will affect the goal subscription number.
The number of target subscriptions in each month will now be visible.
month_y <- Bank %>%
filter(y == "yes") %>%
group_by(month) %>%
summarise(percentage = (n()/sum(Bank$y == "yes"))*100)
month_y
## # A tibble: 12 × 2
## month percentage
## <fct> <dbl>
## 1 jan 3.07
## 2 feb 7.29
## 3 mar 4.03
## 4 apr 10.7
## 5 may 17.9
## 6 jun 10.6
## 7 jul 11.7
## 8 aug 15.2
## 9 sep 3.26
## 10 oct 7.10
## 11 nov 7.49
## 12 dec 1.73
ggplot(month_y, aes(x = month, y = percentage))+
geom_col(fill = "lightgreen")+
geom_text(aes(label = paste(round(percentage, 3), "%")),
vjust = 0.1,
colour = "black")+
labs(title = "Percentage of clients who decide to sign up for the deposit")
We are able to see that at least 17.85% of customers choose to subscribe in May. Only 1.73% of customers chose to subscribe in December, though.
What about the ages of the subscribed client? Here we’re going to divide the ages of the subscribed customer.
age_gap <- c(18, 20, 30, 40, 50, 60, 70, 80, 90, 100)
Bank$age <- cut(Bank$age, breaks = age_gap, labels = c("18-20", "21-30", "31-40", "41-50", "51-60", "61-70", "71-80", "81-90", "90+"))
age_y <- Bank %>%
filter(y == "yes") %>%
group_by(age) %>%
summarise(frequency = n())
age_y
## # A tibble: 8 × 2
## age frequency
## <fct> <int>
## 1 18-20 3
## 2 21-30 87
## 3 31-40 178
## 4 41-50 123
## 5 51-60 82
## 6 61-70 24
## 7 71-80 20
## 8 81-90 4
We can see from the graphic above that the majority of the customers who choose to subscribe are between the ages of 31 and 40. While the youngest deposit subscribers were two individuals between the ages of 18 and 20.
Since we are focusing on the subscribed clients, we’re about to get to know about their relevant information from their job and education
job_y <- Bank %>%
filter(y == "yes") %>%
group_by(job) %>%
summarise(freq = n())
job_y
## # A tibble: 12 × 2
## job freq
## <fct> <int>
## 1 admin. 58
## 2 blue-collar 69
## 3 entrepreneur 15
## 4 housemaid 14
## 5 management 131
## 6 retired 54
## 7 self-employed 20
## 8 services 38
## 9 student 19
## 10 technician 83
## 11 unemployed 13
## 12 unknown 7
From the table above, most of the subscribed clients are working in management filed.
education_y <- Bank %>%
filter(y == "yes") %>%
group_by(education) %>%
summarise(freq = n())
education_y
## # A tibble: 4 × 2
## education freq
## <fct> <int>
## 1 primary 64
## 2 secondary 245
## 3 tertiary 193
## 4 unknown 19
On the other hand, the majority of subscribers are on the secondary education track.
test <- filter(Bank, y == "yes")
test[test$balance == min(test$balance), ]
## # A tibble: 1 × 12
## age job marital education default balance housing loan contact month
## <fct> <fct> <fct> <fct> <fct> <int> <fct> <fct> <fct> <fct>
## 1 51-60 retired married secondary no -1206 yes no cellular jun
## # ℹ 2 more variables: campaign <int>, y <fct>
The lowest yearly balance came from a client who is already retired.
test %>%
group_by(contact) %>%
summarise(count = n())
## # A tibble: 3 × 2
## contact count
## <fct> <int>
## 1 cellular 416
## 2 telephone 44
## 3 unknown 61
It appears, that mostly subscribers was contacted via cellular
test %>%
group_by(contact) %>%
summarise(mean(campaign))
## # A tibble: 3 × 2
## contact `mean(campaign)`
## <fct> <dbl>
## 1 cellular 2.27
## 2 telephone 2.52
## 3 unknown 2.05
It seems that most subscribers were contacted 2 or 3 times to decide to subscribe.
xtabs(balance~housing +loan, test)
## loan
## housing no yes
## no 506263 12985
## yes 279648 20093
plot(xtabs(balance~housing+loan, test))
The majority of subscribers don’t seem to have personal loans or even
housing loans. While the minority on the other hand, have personal loan
but not a housing loan.
xtabs(balance~marital+default, test)
## default
## marital no yes
## divorced 113963 540
## married 455074 359
## single 249188 -135
plot(xtabs(balance~marital+default, test))
Now we found that most subscribers are married with no credit in default.
The EDA for this project ends there. In order to create a statistical model with a target of the customers subscription, it was necessary to obtain this project in order to examine the insights that the dataset bring. Other than that, it was done to demonstrate the writer’s EDA programming abilities.
Thank you:)