The dataset to be worked with can be found in Bank Marketing

Explanation

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

Input Data

Bank Marketing provides many version of data, but the data that will be used in this project will be the ‘bank.xlsx’ version.

library(readxl)
bank <- read_excel("D:/WRITE IT RIGHT 2/bank-marketing/bank.xlsx")

Data Inspection

head(bank)

## # A tibble: 6 × 17
##     age job        marital education default balance housing loan  contact   day
##   <dbl> <chr>      <chr>   <chr>     <chr>     <dbl> <chr>   <chr> <chr>   <dbl>
## 1    30 unemployed married primary   no         1787 no      no    cellul…    19
## 2    33 services   married secondary no         4789 yes     yes   cellul…    11
## 3    35 management single  tertiary  no         1350 yes     no    cellul…    16
## 4    30 management married tertiary  no         1476 yes     yes   unknown     3
## 5    59 blue-coll… married secondary no            0 yes     no    unknown     5
## 6    35 management single  tertiary  no          747 no      no    cellul…    23
## # ℹ 7 more variables: month <chr>, duration <dbl>, campaign <dbl>, pdays <dbl>,
## #   previous <dbl>, poutcome <chr>, y <chr>

tail(bank)

## # A tibble: 6 × 17
##     age job        marital education default balance housing loan  contact   day
##   <dbl> <chr>      <chr>   <chr>     <chr>     <dbl> <chr>   <chr> <chr>   <dbl>
## 1    32 services   single  secondary no          473 yes     no    cellul…     7
## 2    33 services   married secondary no         -333 yes     no    cellul…    30
## 3    57 self-empl… married tertiary  yes       -3313 yes     yes   unknown     9
## 4    57 technician married secondary no          295 no      no    cellul…    19
## 5    28 blue-coll… married secondary no         1137 no      no    cellul…     6
## 6    44 entrepren… single  tertiary  no         1136 yes     yes   cellul…     3
## # ℹ 7 more variables: month <chr>, duration <dbl>, campaign <dbl>, pdays <dbl>,
## #   previous <dbl>, poutcome <chr>, y <chr>

dim(bank)

## [1] 4521   17

names(bank)

##  [1] "age"       "job"       "marital"   "education" "default"   "balance"  
##  [7] "housing"   "loan"      "contact"   "day"       "month"     "duration" 
## [13] "campaign"  "pdays"     "previous"  "poutcome"  "y"

From the inspection, we can conclude:
* bank data contain 4521 rows with 17 columns
* Each columns name: “age”, “job”, “marital”, “education”, “default”, “balance”, “housing”, “loan”, “contact”, “day”, “month”, “duration”, “campaign”, “pdays”, “previous”, “poutcome”, and “y”

Data Cleansing & Coertions

Check data types of each columns

str(bank)

## tibble [4,521 × 17] (S3: tbl_df/tbl/data.frame)
##  $ age      : num [1:4521] 30 33 35 30 59 35 36 39 41 43 ...
##  $ job      : chr [1:4521] "unemployed" "services" "management" "management" ...
##  $ marital  : chr [1:4521] "married" "married" "single" "married" ...
##  $ education: chr [1:4521] "primary" "secondary" "tertiary" "tertiary" ...
##  $ default  : chr [1:4521] "no" "no" "no" "no" ...
##  $ balance  : num [1:4521] 1787 4789 1350 1476 0 ...
##  $ housing  : chr [1:4521] "no" "yes" "yes" "yes" ...
##  $ loan     : chr [1:4521] "no" "yes" "no" "yes" ...
##  $ contact  : chr [1:4521] "cellular" "cellular" "cellular" "unknown" ...
##  $ day      : num [1:4521] 19 11 16 3 5 23 14 6 14 17 ...
##  $ month    : chr [1:4521] "oct" "may" "apr" "jun" ...
##  $ duration : num [1:4521] 79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign : num [1:4521] 1 1 1 4 1 2 1 2 2 1 ...
##  $ pdays    : num [1:4521] -1 339 330 -1 -1 176 330 -1 -1 147 ...
##  $ previous : num [1:4521] 0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome : chr [1:4521] "unknown" "failure" "failure" "unknown" ...
##  $ y        : chr [1:4521] "no" "no" "no" "no" ...

From this result, we find some of data type not in the corect type. we need to convert it into correct type (data coertion)

bank$job <- as.factor(bank$job)
bank$marital <- as.factor(bank$marital)
bank$education <- as.factor(bank$education)
bank$default <- as.factor(bank$default)
bank$balance <- as.integer(bank$balance)
bank$housing <- as.factor(bank$housing)
bank$loan <- as.factor(bank$loan)
bank$contact <- as.factor(bank$contact)
bank$month <- as.factor(bank$month)
bank$campaign <- as.integer(bank$campaign)
bank$pdays <- as.integer(bank$pdays)
bank$poutcome <- as.factor(bank$poutcome)
bank$y <- as.factor(bank$y)

str(bank)

## tibble [4,521 × 17] (S3: tbl_df/tbl/data.frame)
##  $ age      : num [1:4521] 30 33 35 30 59 35 36 39 41 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ balance  : int [1:4521] 1787 4789 1350 1476 0 747 307 147 221 -88 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
##  $ day      : num [1:4521] 19 11 16 3 5 23 14 6 14 17 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
##  $ duration : num [1:4521] 79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign : int [1:4521] 1 1 1 4 1 2 1 2 2 1 ...
##  $ pdays    : int [1:4521] -1 339 330 -1 -1 176 330 -1 -1 147 ...
##  $ previous : num [1:4521] 0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

All columns have been updated to the desired data type.

Checking the missing values

colSums(is.na(bank))

##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0

anyNA(bank)

## [1] FALSE

This is great, there is no missing values in the data.

Before we do the exploration, we will subset the data in which only giving the desired columns and save it with Bank variable.

The reason why I decided to not include the ‘day’ column is because this dataset didn’t provide more information about the year of the contact. Therefore we will do exploration according to the ‘month’ of the customer contact.

Bank <- bank[,c(1:9, 11, 13, 17)]
head(Bank)

## # A tibble: 6 × 12
##     age job        marital education default balance housing loan  contact month
##   <dbl> <fct>      <fct>   <fct>     <fct>     <int> <fct>   <fct> <fct>   <fct>
## 1    30 unemployed married primary   no         1787 no      no    cellul… oct  
## 2    33 services   married secondary no         4789 yes     yes   cellul… may  
## 3    35 management single  tertiary  no         1350 yes     no    cellul… apr  
## 4    30 management married tertiary  no         1476 yes     yes   unknown jun  
## 5    59 blue-coll… married secondary no            0 yes     no    unknown may  
## 6    35 management single  tertiary  no          747 no      no    cellul… feb  
## # ℹ 2 more variables: campaign <int>, y <fct>

Data Exploratory

Brief explanation

summary(Bank)

##       age                 job          marital         education    default   
##  Min.   :19.00   management :969   divorced: 528   primary  : 678   no :4445  
##  1st Qu.:33.00   blue-collar:946   married :2797   secondary:2306   yes:  76  
##  Median :39.00   technician :768   single  :1196   tertiary :1350             
##  Mean   :41.17   admin.     :478                   unknown  : 187             
##  3rd Qu.:49.00   services   :417                                              
##  Max.   :87.00   retired    :230                                              
##                  (Other)    :713                                              
##     balance      housing     loan           contact         month     
##  Min.   :-3313   no :1962   no :3830   cellular :2896   may    :1398  
##  1st Qu.:   69   yes:2559   yes: 691   telephone: 301   jul    : 706  
##  Median :  444                         unknown  :1324   aug    : 633  
##  Mean   : 1423                                          jun    : 531  
##  3rd Qu.: 1480                                          nov    : 389  
##  Max.   :71188                                          apr    : 293  
##                                                         (Other): 571  
##     campaign        y       
##  Min.   : 1.000   no :4000  
##  1st Qu.: 1.000   yes: 521  
##  Median : 2.000             
##  Mean   : 2.794             
##  3rd Qu.: 3.000             
##  Max.   :50.000             
##

Columns description:
* ‘age’ : the customer age
* ‘job’ : type of the customer job
* ‘marital’ : customer marital status
* ‘education’ : customer education level
* ‘default’ : information if the customer has a credit in default
* ‘balance’ : the average yearly balance
* ‘housing’ : information if the customer has a housing loan
* ‘loan’ : information if the customer has a personal loan
* ‘contact’ : the contact communication type
* ‘month’ : last contact month of year
* ‘campaign’ : number of contacts performed during this campaign and for this client
* ‘y’ : the target (information if the customer has decided to subscribe or not)

Correlation

We will see if there is a correlation between each numeric column to the target y

library(reshape2)
library(ggplot2)
library(gcookbook)
library(magrittr)
library(dplyr)

numeric_vars <- names(Bank)[sapply(Bank, is.numeric)]
numeric_data <- Bank[, numeric_vars]

numeric_data$target_var <- ifelse(Bank$y == "yes", 1, 0)

cor_matrix <- cor(numeric_data)
cor_melted <- melt(cor_matrix)

ggplot(cor_melted, aes(Var1, Var2, fill = value))+
  geom_tile()+
  geom_text(aes(label = round(value, 2)), color = "black", size = 3, vjust = 0.5) +
  scale_fill_gradient2(high = "darkcyan", midpoint = 0) +
  theme_minimal() +
  labs(x = "", y = "", title = "Heatmap of Numeric Variables with Target Variable") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text.y = element_text(angle = 0, hjust = 1))

According to the heatmap above, there is no strong correlation between the numeric factors with the target.

Obtained Insights

The Information of The Bank’s Target

Here we have the distribution of the target.

ggplot(Bank, aes(x = factor(y)))+
  geom_bar(fill = "lightgreen")+
  geom_text(stat = "count", aes(label = paste0(..count.., " (", scales::percent(..count../sum(..count..)), ")")), vjust = 1.5, colour = "black")+
  labs(title = "Number of Subscription Target")

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The target is quite unequal, as shown by the bar chart above. As a result, we’ll make an effort to concentrate on the subscribers’ customers.

Check The Outlier

Within The Age

aggregate(age~y, Bank, mean)

##     y      age
## 1  no 40.99800
## 2 yes 42.49136

aggregate(age~y, Bank, sd)

##     y      age
## 1  no 10.18840
## 2 yes 13.11577

boxplot(Bank$age)

From the results above, we can see that there are possibilities of outliers, but since the standard deviation is below the mean, we can say that the age’s of each category of target(y) is quite homogenous.

Within The Balance

aggregate(balance~y, Bank, mean)

##     y  balance
## 1  no 1403.212
## 2 yes 1571.956

aggregate(balance~y, Bank, sd)

##     y  balance
## 1  no 3075.349
## 2 yes 2444.399

boxplot(Bank$balance)

According to the results above, we can see that there outliers within the balance, other than that, the standard deviation within each category of y are bigger than the mean which also indicate that the range of data is quite heterogeneous.

Informations According To The Month

First, we will set the order of the month.

month_order <- c("jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec")

Bank$month <- factor(Bank$month, levels = month_order)

Now we’ll see the number of contact that happened in each month.

ggplot(Bank, aes(x = month)) +
  geom_bar(fill = "lightgreen") +
  geom_text(stat = "count", 
            aes(label = paste0(scales::percent(..count../sum(..count..)))), 
            vjust = 0.1, 
            colour = "black") +
  labs(title = "Amount of contact according by month")

We see that the month of May had the most contact while the month of December had the least. This will affect the goal subscription number.

The number of target subscriptions in each month will now be visible.

month_y <- Bank %>%
  filter(y == "yes") %>%
  group_by(month) %>%
  summarise(percentage = (n()/sum(Bank$y == "yes"))*100)

month_y

## # A tibble: 12 × 2
##    month percentage
##    <fct>      <dbl>
##  1 jan         3.07
##  2 feb         7.29
##  3 mar         4.03
##  4 apr        10.7 
##  5 may        17.9 
##  6 jun        10.6 
##  7 jul        11.7 
##  8 aug        15.2 
##  9 sep         3.26
## 10 oct         7.10
## 11 nov         7.49
## 12 dec         1.73

ggplot(month_y, aes(x = month, y = percentage))+
  geom_col(fill = "lightgreen")+
  geom_text(aes(label = paste(round(percentage, 3), "%")),
            vjust = 0.1, 
            colour = "black")+
  labs(title = "Percentage of clients who decide to sign up for the deposit")

We are able to see that at least 17.85% of customers choose to subscribe in May. Only 1.73% of customers chose to subscribe in December, though.

Information According To The Group of The Age

What about the ages of the subscribed client? Here we’re going to divide the ages of the subscribed customer.

age_gap <- c(18, 20, 30, 40, 50, 60, 70, 80, 90, 100)
Bank$age <- cut(Bank$age, breaks = age_gap, labels = c("18-20", "21-30", "31-40", "41-50", "51-60", "61-70", "71-80", "81-90", "90+"))

age_y <- Bank %>%
  filter(y == "yes") %>%
  group_by(age) %>%
  summarise(frequency = n())

age_y

## # A tibble: 8 × 2
##   age   frequency
##   <fct>     <int>
## 1 18-20         3
## 2 21-30        87
## 3 31-40       178
## 4 41-50       123
## 5 51-60        82
## 6 61-70        24
## 7 71-80        20
## 8 81-90         4

We can see from the graphic above that the majority of the customers who choose to subscribe are between the ages of 31 and 40. While the youngest deposit subscribers were two individuals between the ages of 18 and 20.

Information According to The Job and Education

Since we are focusing on the subscribed clients, we’re about to get to know about their relevant information from their job and education

job_y <- Bank %>%
  filter(y == "yes") %>%
  group_by(job) %>%
  summarise(freq = n())

job_y

## # A tibble: 12 × 2
##    job            freq
##    <fct>         <int>
##  1 admin.           58
##  2 blue-collar      69
##  3 entrepreneur     15
##  4 housemaid        14
##  5 management      131
##  6 retired          54
##  7 self-employed    20
##  8 services         38
##  9 student          19
## 10 technician       83
## 11 unemployed       13
## 12 unknown           7

From the table above, most of the subscribed clients are working in management filed.

education_y <- Bank %>%
  filter(y == "yes") %>%
  group_by(education) %>%
  summarise(freq = n())

education_y

## # A tibble: 4 × 2
##   education  freq
##   <fct>     <int>
## 1 primary      64
## 2 secondary   245
## 3 tertiary    193
## 4 unknown      19

On the other hand, the majority of subscribers are on the secondary education track.

Business Question

Which job gives the lowest yearly balance to subscribe the deposit?

test <- filter(Bank, y == "yes")
test[test$balance == min(test$balance), ]

## # A tibble: 1 × 12
##   age   job     marital education default balance housing loan  contact  month
##   <fct> <fct>   <fct>   <fct>     <fct>     <int> <fct>   <fct> <fct>    <fct>
## 1 51-60 retired married secondary no        -1206 yes     no    cellular jun  
## # ℹ 2 more variables: campaign <int>, y <fct>

The lowest yearly balance came from a client who is already retired.

Which way of contact that has obtained many subscribers?

test %>%
  group_by(contact) %>%
  summarise(count = n())

## # A tibble: 3 × 2
##   contact   count
##   <fct>     <int>
## 1 cellular    416
## 2 telephone    44
## 3 unknown      61

It appears, that mostly subscribers was contacted via cellular

How many contacts that most needed in order to get subscriber ?

test %>%
  group_by(contact) %>%
  summarise(mean(campaign))

## # A tibble: 3 × 2
##   contact   `mean(campaign)`
##   <fct>                <dbl>
## 1 cellular              2.27
## 2 telephone             2.52
## 3 unknown               2.05

It seems that most subscribers were contacted 2 or 3 times to decide to subscribe.

How much yearly balance from each category of housing and loan?

xtabs(balance~housing +loan, test)

##        loan
## housing     no    yes
##     no  506263  12985
##     yes 279648  20093

plot(xtabs(balance~housing+loan, test))

The majority of subscribers don’t seem to have personal loans or even housing loans. While the minority on the other hand, have personal loan but not a housing loan.

How much the yearly balance from each category of marital and credit default?

xtabs(balance~marital+default, test)

##           default
## marital        no    yes
##   divorced 113963    540
##   married  455074    359
##   single   249188   -135

plot(xtabs(balance~marital+default, test))

Now we found that most subscribers are married with no credit in default.

Closing

The EDA for this project ends there. In order to create a statistical model with a target of the customers subscription, it was necessary to obtain this project in order to examine the insights that the dataset bring. Other than that, it was done to demonstrate the writer’s EDA programming abilities.

Thank you:)

EDA: Portuguese Bank Marketing

Julia Oriana

2024-02-16