library(DT)
library(graphics)

1 Introduction

Exploratory Data Analysis refers to the critical process of performing initial investigations on data to discover patterns, spot anomalies, test hypotheses, and check assumptions with the help of summary statistics and graphical representations. Understanding the data with basic statistics will give us a tremendous advantage in making the best business decision. We try to get insights from The data is related to direct marketing campaigns (phone calls) of a Portuguese banking institution.

1.1 The Source of The Data

To access the data, check the followinG URL https://archive.ics.uci.edu/ml/datasets/bank+marketing

2 Reading The Data

bank <- read.csv("bank-full.csv", header = TRUE, sep = ";")
rmarkdown::paged_table(bank)
dim(bank)
#> [1] 45211    17
names(bank)
#>  [1] "age"       "job"       "marital"   "education" "default"   "balance"  
#>  [7] "housing"   "loan"      "contact"   "day"       "month"     "duration" 
#> [13] "campaign"  "pdays"     "previous"  "poutcome"  "y"

The dataset includes information as follow:

  • 1 - age (numeric)
  • 2 - job : type of job (categorical)
  • 3 - marital : marital status (categorical)
  • 4 - education (categorical)
  • 5 - default: has credit in default? (categorical)
  • 6 - housing: has housing loan? (categorical)
  • 7 - loan: has personal loan? (categorical)
  • 8 - contact: contact communication type (categorical)
  • 9 - month: last contact month of year (categorical)
  • 10 - day_of_week: last contact day of the week (categorical)
  • 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
  • 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  • 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  • 14 - previous: number of contacts performed before this campaign and for this client (numeric)
  • 15 - poutcome: outcome of the previous marketing campaign (categorical)
  • 16 - balance: average yearly balance
  • 17 - y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

3 Data Wrangling

3.1 Checking Missing Value

We will see whether the data have missing value or not. Missing values are represented by the symbol NA (not available) in R.

colSums(is.na(bank))
#>       age       job   marital education   default   balance   housing      loan 
#>         0         0         0         0         0         0         0         0 
#>   contact       day     month  duration  campaign     pdays  previous  poutcome 
#>         0         0         0         0         0         0         0         0 
#>         y 
#>         0
anyNA(bank)
#> [1] FALSE

3.2 Converting Variables Type

We see no missing value in the data. The next step is viewing the metadata to check the type of each variable.

str(bank)
#> 'data.frame':    45211 obs. of  17 variables:
#>  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
#>  $ job      : chr  "management" "technician" "entrepreneur" "blue-collar" ...
#>  $ marital  : chr  "married" "single" "married" "married" ...
#>  $ education: chr  "tertiary" "secondary" "secondary" "unknown" ...
#>  $ default  : chr  "no" "no" "no" "no" ...
#>  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
#>  $ housing  : chr  "yes" "yes" "yes" "yes" ...
#>  $ loan     : chr  "no" "no" "yes" "no" ...
#>  $ contact  : chr  "unknown" "unknown" "unknown" "unknown" ...
#>  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
#>  $ month    : chr  "may" "may" "may" "may" ...
#>  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
#>  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
#>  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ poutcome : chr  "unknown" "unknown" "unknown" "unknown" ...
#>  $ y        : chr  "no" "no" "no" "no" ...

From this result, we find some of the data type not in the correct type. we need to convert it into the correct type (data coercion)

bank$job <- as.factor(bank$job)
bank$marital <- as.factor(bank$marital)
bank$education <- as.factor(bank$education)
bank$default <- as.factor(bank$default)
bank$housing <- as.factor(bank$housing)
bank$loan <- as.factor(bank$loan)
bank$contact <- as.factor(bank$contact)
bank$month <- as.factor(bank$month)
bank$default <- as.factor(bank$default)
bank$poutcome <- as.factor(bank$poutcome)
bank$y <- as.factor(bank$y)

str(bank)
#> 'data.frame':    45211 obs. of  17 variables:
#>  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
#>  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
#>  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
#>  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
#>  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
#>  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
#>  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
#>  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
#>  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
#>  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
#>  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
#>  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
#>  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
#>  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
#>  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Each column already changed into the desired data type.

4 Data Analysis

nrow(bank)
#> [1] 45211
summary(bank)
#>       age                 job           marital          education    
#>  Min.   :18.00   blue-collar:9732   divorced: 5207   primary  : 6851  
#>  1st Qu.:33.00   management :9458   married :27214   secondary:23202  
#>  Median :39.00   technician :7597   single  :12790   tertiary :13301  
#>  Mean   :40.94   admin.     :5171                    unknown  : 1857  
#>  3rd Qu.:48.00   services   :4154                                     
#>  Max.   :95.00   retired    :2264                                     
#>                  (Other)    :6835                                     
#>  default        balance       housing      loan            contact     
#>  no :44396   Min.   : -8019   no :20081   no :37967   cellular :29285  
#>  yes:  815   1st Qu.:    72   yes:25130   yes: 7244   telephone: 2906  
#>              Median :   448                           unknown  :13020  
#>              Mean   :  1362                                            
#>              3rd Qu.:  1428                                            
#>              Max.   :102127                                            
#>                                                                        
#>       day            month          duration         campaign     
#>  Min.   : 1.00   may    :13766   Min.   :   0.0   Min.   : 1.000  
#>  1st Qu.: 8.00   jul    : 6895   1st Qu.: 103.0   1st Qu.: 1.000  
#>  Median :16.00   aug    : 6247   Median : 180.0   Median : 2.000  
#>  Mean   :15.81   jun    : 5341   Mean   : 258.2   Mean   : 2.764  
#>  3rd Qu.:21.00   nov    : 3970   3rd Qu.: 319.0   3rd Qu.: 3.000  
#>  Max.   :31.00   apr    : 2932   Max.   :4918.0   Max.   :63.000  
#>                  (Other): 6060                                    
#>      pdays          previous           poutcome       y        
#>  Min.   : -1.0   Min.   :  0.0000   failure: 4901   no :39922  
#>  1st Qu.: -1.0   1st Qu.:  0.0000   other  : 1840   yes: 5289  
#>  Median : -1.0   Median :  0.0000   success: 1511              
#>  Mean   : 40.2   Mean   :  0.5803   unknown:36959              
#>  3rd Qu.: -1.0   3rd Qu.:  0.0000                              
#>  Max.   :871.0   Max.   :275.0000                              
#> 

Based on the summary above, we find pieces of information:

  1. The number of subscriber to a term deposit with the answer no overcomes the yes answer.

We can see the distribution of the subscriber with their equal marital status and education.

bank_yes <- bank[bank$y == "yes",]
graphics::pie(xtabs(~ bank_yes$marital, bank_yes$y))

graphics::pie(xtabs(~ bank_yes$education, bank_yes$y))

xtabs(~ bank_yes$marital + bank_yes$education, bank_yes)
#>                 bank_yes$education
#> bank_yes$marital primary secondary tertiary unknown
#>         divorced     104       291      203      24
#>         married      396      1304      913     142
#>         single        91       855      880      86

The subscribers mostly come from married and secondary education status. Next we’ll find out the balance of the customes with their job.

blue_collar.mean <- (mean(bank_yes$balance[bank_yes$job == "blue-collar"]))
management.mean <- (mean(bank_yes$balance[bank_yes$job == "management"]))
technician.mean <- (mean(bank_yes$balance[bank_yes$job == "technician"]))
admin.mean <- (mean(bank_yes$balance[bank_yes$job == "admin."]))
services.mean <- (mean(bank_yes$balance[bank_yes$job == "services"]))
retired.mean <- (mean(bank_yes$balance[bank_yes$job == "retired"]))
other <- subset(bank_yes, job != "blue-collar" | job != "management"| job!= "technician" | job!= "admin."| job!= "services"| job != "retired")
other.mean <-  mean(other$balance)

job.mean <- cbind(job=c("blue-collar","management","technician","admin.","services", "retired", "other"))
job.mean <- cbind(job.mean,as.data.frame(c(blue_collar.mean,management.mean,technician.mean,admin.mean,services.mean, retired.mean, other.mean)))

names(job.mean)[2] <- paste("Mean")
job.mean
graphics::barplot(xtabs(Mean ~ job, job.mean), col= "Green")

Customers who retired have the highest average yearly balance with the average annual balance of 2690 Euro, with the following ranking from management in the second place and technician in the third place.

bank_yes$age.Range <-c("18-32", "33-39", "40-47", "48-95")[findInterval(as.numeric(as.character(bank_yes$age)) , c(18, 33, 40, 48, Inf) )]
xtabs(~ bank_yes$age.Range + bank_yes$job, bank_yes)
#>                   bank_yes$job
#> bank_yes$age.Range admin. blue-collar entrepreneur housemaid management retired
#>              18-32    193         218           18        10        372       0
#>              33-39    164         179           35        20        389       1
#>              40-47    131         169           30        17        227       4
#>              48-95    143         142           40        62        313     511
#>                   bank_yes$job
#> bank_yes$age.Range self-employed services student technician unemployed unknown
#>              18-32            76      125     247        256         52       5
#>              33-39            42      116      19        253         46       5
#>              40-47            33       62       3        153         45       6
#>              48-95            36       66       0        178         59      18
graphics::barplot(xtabs(~ age.Range + job, bank_yes), legend.text = TRUE)

table(droplevels(bank_yes$job),bank_yes$age.Range)
#>                
#>                 18-32 33-39 40-47 48-95
#>   admin.          193   164   131   143
#>   blue-collar     218   179   169   142
#>   entrepreneur     18    35    30    40
#>   housemaid        10    20    17    62
#>   management      372   389   227   313
#>   retired           0     1     4   511
#>   self-employed    76    42    33    36
#>   services        125   116    62    66
#>   student         247    19     3     0
#>   technician      256   253   153   178
#>   unemployed       52    46    45    59
#>   unknown           5     5     6    18

The Data show the customers with the management background are the most likely to subscribe to the marketing campaign. We also see here the customers with management background at the age interval between 33 - 39 are the highest subscribers.

crosstab <- xtabs(formula = duration ~ contact + job, data = bank_yes)
heatmap(crosstab, Rowv = NA, Colv = NA, cexCol = 0.8, scale = "row")

We see that the customer dialed through telephone and retired has the highest in the last contact duration. Interestingly, the blue-collar people have the highest last contact duration through an unknown source. Further investigation may appear regarding this insight.

graphics::pie(xtabs(~ bank_yes$loan + bank_yes$housing, bank_yes$y))

xtabs(~ bank_yes$loan + bank_yes$housing, bank_yes$y)
#>              bank_yes$housing
#> bank_yes$loan   no  yes
#>           no  3135 1670
#>           yes  219  265

The subscribers with no housing and personal loan will likely to accept the marketing campaign.

5 Business Reccomendation

We have a few insights regarding the data. We find that customers who will likely subscribe come from the person with married status = yes, has a secondary education, with a management job. The target customer may come from the age ranging from 33 - 39 and should be called by cellular. That customer may also have no loan to take this consideration.

As we know, the data provide much more insight that has been dug. We have not discovered other variables that may affect a customer’s decision to participate in the campaign. Further analysis should be done to take insight genuinely; for example, we may run a machine learning model for more careful consideration.