Bank Marketing Analaysis

Reynaldi Gevin

14/4/2022

1. About Dataset

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (or not) subscribed.


2. Read Data

Let’s read the data and understand the meaning of each attribute in the dataset :

Load Data

bank <- read.csv('Data Input/Bank Marketing/bank-full.csv',sep = ';')
datatable(bank, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

Meta Data

Input variables bank :

  1. Client Data
    • age (numeric)
    • job : type of job (categorical):
      • “admin.”
      • “unknown”
      • “unemployed”
      • “management”
      • “housemaid”
      • “entrepreneur”
      • “student”
      • “blue-collar”
      • “self-employed”
      • “retired”
      • “technician”
      • “services”
    • marital : marital status (categorical):
      • “married”
      • “divorced”
      • “single”, note: “divorced” means divorced or widowed
    • education (categorical):
      • “unknown”
      • “secondary”
      • “primary”
      • “tertiary”
    • default: has credit in default? (binary: “yes”,“no”)
    • balance: average yearly balance, in euros (numeric)
    • housing: has housing loan? (binary: “yes”,“no”)
    • loan: has personal loan? (binary: “yes”,“no”)
  2. Related with the last contact of the current campaign:
    • contact: contact communication type (categorical):
      • “unknown”
      • “telephone”
    • “cellular”
    • day: last contact day of the month (numeric)
    • month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
    • duration: last contact duration, in seconds (numeric)
  3. Other attributes:
    • campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
    • pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
    • previous: number of contacts performed before this campaign and for this client (numeric)
    • poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)
    • y - has the client subscribed a term deposit? (binary: “yes”,“no”)

3. Data Cleansing

Let’s find out more about our data

3.1. Data Wrangling

# Data structure check
str(bank)
## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : chr  "management" "technician" "entrepreneur" "blue-collar" ...
##  $ marital  : chr  "married" "single" "married" "married" ...
##  $ education: chr  "tertiary" "secondary" "secondary" "unknown" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : chr  "yes" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "no" "yes" "no" ...
##  $ contact  : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : chr  "may" "may" "may" "may" ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...

There are still attributes with the wrong data type in our data, such as logical which is still in character form. Let’s check the other data types whether they are correct or not

We have to check the unique value of each column so that we can determine the right data types for factor

# Check and count unique value
apply(bank, 2, function(x) length(unique(x)))
##       age       job   marital education   default   balance   housing      loan 
##        77        12         3         4         2      7168         2         2 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         3        31        12      1573        48       559        41         4 
##         y 
##         2

Because the number of unique values is much less than the number of observations, there are several attributes that must be changed to the data type into factor form.

bank <- bank %>% 
  mutate_at(vars(job, marital, education, default, housing, loan, contact, poutcome, month, y), as.factor)
str(bank)
## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

3.2. Missing Value check

colSums(is.na(bank))
##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0

Based on the results obtained, there is no missing value in our data

But we have to be suspicious of other missing value writing formats that might be in the dataset . let’s check the word “unknown” in the dataset

bank %>%
  summarise_all(list(~sum(. == "unknown")))
##   age job marital education default balance housing loan contact day month
## 1   0 288       0      1857       0       0       0    0   13020   0     0
##   duration campaign pdays previous poutcome y
## 1        0        0     0        0    36959 0

Turns out that there is another missing value format in our dataset, namely “unknown”.

# Frequency target variable
table(bank$y)
## 
##    no   yes 
## 39922  5289
# Frequency poutcome vs target variable
table(bank$y, bank$poutcome)
##      
##       failure other success unknown
##   no     4283  1533     533   33573
##   yes     618   307     978    3386

But we can’t delete the data from our data because it contains a lot of data for the “yes” category in the target variable

3.3. Cek Duplicate

bank[duplicated(bank)]
## data frame with 0 columns and 45211 rows

No duplicate data in dataset

4. Data Understanding

4.1. Data dimensision check

dim(bank)
## [1] 45211    17

The data we use has dimensions of 45211 x 17, where there are 45211 observations (row) and 17 variables (columns).

4.2. Summary check

summary(bank)
##       age                 job           marital          education    
##  Min.   :18.00   blue-collar:9732   divorced: 5207   primary  : 6851  
##  1st Qu.:33.00   management :9458   married :27214   secondary:23202  
##  Median :39.00   technician :7597   single  :12790   tertiary :13301  
##  Mean   :40.94   admin.     :5171                    unknown  : 1857  
##  3rd Qu.:48.00   services   :4154                                     
##  Max.   :95.00   retired    :2264                                     
##                  (Other)    :6835                                     
##  default        balance       housing      loan            contact     
##  no :44396   Min.   : -8019   no :20081   no :37967   cellular :29285  
##  yes:  815   1st Qu.:    72   yes:25130   yes: 7244   telephone: 2906  
##              Median :   448                           unknown  :13020  
##              Mean   :  1362                                            
##              3rd Qu.:  1428                                            
##              Max.   :102127                                            
##                                                                        
##       day            month          duration         campaign     
##  Min.   : 1.00   may    :13766   Min.   :   0.0   Min.   : 1.000  
##  1st Qu.: 8.00   jul    : 6895   1st Qu.: 103.0   1st Qu.: 1.000  
##  Median :16.00   aug    : 6247   Median : 180.0   Median : 2.000  
##  Mean   :15.81   jun    : 5341   Mean   : 258.2   Mean   : 2.764  
##  3rd Qu.:21.00   nov    : 3970   3rd Qu.: 319.0   3rd Qu.: 3.000  
##  Max.   :31.00   apr    : 2932   Max.   :4918.0   Max.   :63.000  
##                  (Other): 6060                                    
##      pdays          previous           poutcome       y        
##  Min.   : -1.0   Min.   :  0.0000   failure: 4901   no :39922  
##  1st Qu.: -1.0   1st Qu.:  0.0000   other  : 1840   yes: 5289  
##  Median : -1.0   Median :  0.0000   success: 1511              
##  Mean   : 40.2   Mean   :  0.5803   unknown:36959              
##  3rd Qu.: -1.0   3rd Qu.:  0.0000                              
##  Max.   :871.0   Max.   :275.0000                              
## 

Summary :

  • Age variable is normally distributed, indicated by the median and mean values which are not much different
  • The youngest client is 18 years old and the oldest is 95 years old
  • In contrast to age, in the balance variable, mean value is much larger than the median which indicates there is skewness in the variable
  • Inbalanced data targets, where the number of clients who subscribe is much less than those who don’t
  • Previous variable has a very strange distribution, because the maximum value much larger compared to other statistical parameters
  • The type of communication using telephone turns out to be the least in number compared to other types of communication
  • There are far more clients who don’t have default credit than those who have default credit

5. Exploratory Data Analysis

Let’s answer some unique questions about the data set we use !!

  1. How old are people who want to subscribe a term deposit? Show the age distribution!

Based on the picture above, it can be seen that most of the bank’s clients are in the age range of 25-40 years It seems that the banks are not very much interested by contacting the older population. Even though, after the 60-years threshold, the relative frequency is higher when y = true. In other words, we can say that elderly persons are more likely to subscribe to a term deposit.

  1. What are the jobs of the client bank?
table(bank$job, bank$y)
##                
##                   no  yes
##   admin.        4540  631
##   blue-collar   9024  708
##   entrepreneur  1364  123
##   housemaid     1131  109
##   management    8157 1301
##   retired       1748  516
##   self-employed 1392  187
##   services      3785  369
##   student        669  269
##   technician    6757  840
##   unemployed    1101  202
##   unknown        254   34

Clients with this type of management job are the most who subscribe products that offered by the bank and the least are housemaid.

For management, it is very natural to be the one who buys the most products offered, because maybe they already understand financial awarness and care about future investments. While housemaid may still not understand about financial awareness and investment techniques

  1. Does marital status affect?
plot(table(bank$marital, bank$y))

table(bank$marital, bank$y)
##           
##               no   yes
##   divorced  4585   622
##   married  24459  2755
##   single   10878  1912

Based on the plot above, it can be seen that marital status does not significantly affect whether someone wants to subscribe the product offered or not

  1. How about outcome of the previous marketing campaign?
table(bank$poutcome, bank$y)
##          
##              no   yes
##   failure  4283   618
##   other    1533   307
##   success   533   978
##   unknown 33573  3386

There is people who already subscribed to a term deposit after a previous contact have accepted to do it again. Even if they were denied before, they’re still more enthusiastic to accept it So even if the previous campaign was a failure, recontacting people seems important.


6. Conclusion

So, how do increase the potential of clients to want to buy the products that the bank offers?

Based on the analysis we have done, there are various ways to increase the number of clients who want to buy the products we offer.

  1. Banks should start increasing the promotion of their products to their clients aged 60 years and over because based on the exploration carried out, it turns out that there are facts that many clients who subscribe bank term deposit are aged 60 years and over.

  2. Before promoting the products offered, banks should also conduct socialization about the importance of investing to increase financial awareness from all circles of society, so that people understand and are familiar with the products offered by banks.

  3. It is also important to always do promotions to each client even though the client has subscribed the product before. Re-promotion will increase the number of people who want to subscribe the product that the bank offers