LBB: Programming For Data Science

Syahvan Alviansyah Diva Ritonga

2023-06-12

1. Introduction

In the business world, especially consumer data, it is essential to carry out an analysis of consumer behavior to know insights into consumer behavior so that it can accelerate the growth in the number of consumers. The data used is from a bank in Portugal, where the data is obtained to determine whether the consumer will make a loan. Data comes from https://archive.ics.uci.edu/ml/datasets/bank+marketing.

2. Data Processing

2.1 Import data

bank <-read.csv2("bank.csv")

2.2 View the first 10 rows

head(bank, 10)
##    age           job marital education default balance housing loan  contact
## 1   30    unemployed married   primary      no    1787      no   no cellular
## 2   33      services married secondary      no    4789     yes  yes cellular
## 3   35    management  single  tertiary      no    1350     yes   no cellular
## 4   30    management married  tertiary      no    1476     yes  yes  unknown
## 5   59   blue-collar married secondary      no       0     yes   no  unknown
## 6   35    management  single  tertiary      no     747      no   no cellular
## 7   36 self-employed married  tertiary      no     307     yes   no cellular
## 8   39    technician married secondary      no     147     yes   no cellular
## 9   41  entrepreneur married  tertiary      no     221     yes   no  unknown
## 10  43      services married   primary      no     -88     yes  yes cellular
##    day month duration campaign pdays previous poutcome  y
## 1   19   oct       79        1    -1        0  unknown no
## 2   11   may      220        1   339        4  failure no
## 3   16   apr      185        1   330        1  failure no
## 4    3   jun      199        4    -1        0  unknown no
## 5    5   may      226        1    -1        0  unknown no
## 6   23   feb      141        2   176        3  failure no
## 7   14   may      341        1   330        2    other no
## 8    6   may      151        2    -1        0  unknown no
## 9   14   may       57        2    -1        0  unknown no
## 10  17   apr      313        1   147        2  failure no

2.3 View the last 10 rows

tail(bank, 10)
##      age           job marital education default balance housing loan  contact
## 4512  46   blue-collar married secondary      no     668     yes   no  unknown
## 4513  40   blue-collar married secondary      no    1100     yes   no  unknown
## 4514  49   blue-collar married secondary      no     322      no   no cellular
## 4515  38   blue-collar married secondary      no    1205     yes   no cellular
## 4516  32      services  single secondary      no     473     yes   no cellular
## 4517  33      services married secondary      no    -333     yes   no cellular
## 4518  57 self-employed married  tertiary     yes   -3313     yes  yes  unknown
## 4519  57    technician married secondary      no     295      no   no cellular
## 4520  28   blue-collar married secondary      no    1137      no   no cellular
## 4521  44  entrepreneur  single  tertiary      no    1136     yes  yes cellular
##      day month duration campaign pdays previous poutcome   y
## 4512  15   may     1263        2    -1        0  unknown yes
## 4513  29   may      660        2    -1        0  unknown  no
## 4514  14   aug      356        2    -1        0  unknown  no
## 4515  20   apr       45        4   153        1  failure  no
## 4516   7   jul      624        5    -1        0  unknown  no
## 4517  30   jul      329        5    -1        0  unknown  no
## 4518   9   may      153        1    -1        0  unknown  no
## 4519  19   aug      151       11    -1        0  unknown  no
## 4520   6   feb      129        4   211        3    other  no
## 4521   3   apr      345        2   249        7    other  no

Explanation of each column in the bank data:

  1. age (numeric)
  2. job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”,“blue-collar”,“self-employed”,“retired”,“technician”,“services”)
  3. marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed)
  4. education (categorical: “unknown”,“secondary”,“primary”,“tertiary”)
  5. default: has credit in default? (binary: “yes”,“no”)
  6. balance: average yearly balance, in euros (numeric)
  7. housing: has housing loan? (binary: “yes”,“no”)
  8. loan: has personal loan? (binary: “yes”,“no”), related with the last contact of the current campaign:
  9. contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”)
  10. day: last contact day of the month (numeric)
  11. month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
  12. duration: last contact duration, in seconds (numeric)
  13. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  14. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
  15. previous: number of contacts performed before this campaign and for this client (numeric)
  16. poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)
  17. y - has the client subscribed a term deposit? (binary: “yes”,“no”)

2.4 Change the appropriate data type

names <- c('job' ,'marital', "education", "default", "housing", "loan", "contact", "month", "poutcome", "y")
bank[,names] <- lapply(bank[,names] , factor)
str(bank)
## 'data.frame':    4521 obs. of  17 variables:
##  $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
##  $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
##  $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
##  $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
##  $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

2.5 Checks whether there is missing value in the data frame

is.null(bank) 
## [1] FALSE

2.6 View data dimensions

dim(bank)
## [1] 4521   17

From the steps taken, it can be seen that there are 4521 rows with 17 columns without any null data in the dataset.

3. Exploratory Data Analysis

Summary Data

summary(bank)
##       age                 job          marital         education    default   
##  Min.   :19.00   management :969   divorced: 528   primary  : 678   no :4445  
##  1st Qu.:33.00   blue-collar:946   married :2797   secondary:2306   yes:  76  
##  Median :39.00   technician :768   single  :1196   tertiary :1350             
##  Mean   :41.17   admin.     :478                   unknown  : 187             
##  3rd Qu.:49.00   services   :417                                              
##  Max.   :87.00   retired    :230                                              
##                  (Other)    :713                                              
##     balance      housing     loan           contact          day       
##  Min.   :-3313   no :1962   no :3830   cellular :2896   Min.   : 1.00  
##  1st Qu.:   69   yes:2559   yes: 691   telephone: 301   1st Qu.: 9.00  
##  Median :  444                         unknown  :1324   Median :16.00  
##  Mean   : 1423                                          Mean   :15.92  
##  3rd Qu.: 1480                                          3rd Qu.:21.00  
##  Max.   :71188                                          Max.   :31.00  
##                                                                        
##      month         duration       campaign          pdays       
##  may    :1398   Min.   :   4   Min.   : 1.000   Min.   : -1.00  
##  jul    : 706   1st Qu.: 104   1st Qu.: 1.000   1st Qu.: -1.00  
##  aug    : 633   Median : 185   Median : 2.000   Median : -1.00  
##  jun    : 531   Mean   : 264   Mean   : 2.794   Mean   : 39.77  
##  nov    : 389   3rd Qu.: 329   3rd Qu.: 3.000   3rd Qu.: -1.00  
##  apr    : 293   Max.   :3025   Max.   :50.000   Max.   :871.00  
##  (Other): 571                                                   
##     previous          poutcome      y       
##  Min.   : 0.0000   failure: 490   no :4000  
##  1st Qu.: 0.0000   other  : 197   yes: 521  
##  Median : 0.0000   success: 129             
##  Mean   : 0.5426   unknown:3705             
##  3rd Qu.: 0.0000                            
##  Max.   :25.0000                            
## 

Insight:

  • The minimum consumer age is 19 years and the maximum age is 87 years, with an average age of 41 years
  • Most consumers come from management, blue-collar and technician workers
  • The marital status of most consumers is from those who are married, single, then divorced
  • Most consumer education comes from secondary or junior high school
  • The average consumer cash per year is 1423 Euro
  • Of all the consumers, the largest number of homeowners was 2559 people and those who did not have a home either lived in their parent’s house or rented out was 1962
  • from the load column, it is concluded that most consumers do not/do not have loans
  • The most used contact by consumers is mobile
  • There is a minus balance

Amount of balance based on job & education

xtabs(balance~job+education,bank)
##                education
## job             primary secondary tertiary unknown
##   admin.           6640    498984    53718   27038
##   blue-collar    395647    559942    28626   42348
##   entrepreneur     9982     74018   188771    3610
##   housemaid      103005     56333    52636   21412
##   management     104731    145012  1397982   64429
##   retired        219568    219355    76779   17712
##   self-employed   22076     88506   142205    2024
##   services        27683    362595    30318   39754
##   student          3575     75690    22338   28078
##   technician      38895    599876   344274   39160
##   unemployed      22703     69711    39193    7839
##   unknown          2522      9832    19982   24729

How long it takes customers to subscribed a term deposit ?

bank_y <- (bank[bank$y == "yes",])
aggregate(x = duration ~ y + contact, data = bank_y, FUN = "mean")
##     y   contact duration
## 1 yes  cellular 513.4495
## 2 yes telephone 498.2273
## 3 yes   unknown 860.0328

Is duration correlated to customers subscribed a term deposit ?

#convert y column into numeric by where yes as 1 and no as 2
library(dplyr)
bank_y2 <- bank
bank_y2$y <- sapply(as.character(bank_y2$y), switch,
  "yes" = "1",
  "no" = "0"
)
bank_y2$y <- as.numeric(bank_y2$y)
bank_y2 <- select(bank_y2, c(duration,y))
head(bank_y)
##    age         job  marital education default balance housing loan   contact
## 14  20     student   single secondary      no     502      no   no  cellular
## 31  68     retired divorced secondary      no    4189      no   no telephone
## 34  32  management   single  tertiary      no    2536     yes   no  cellular
## 35  49  technician  married  tertiary      no    1235      no   no  cellular
## 37  78     retired divorced   primary      no     229      no   no telephone
## 38  32 blue-collar  married secondary      no    2089     yes   no  cellular
##    day month duration campaign pdays previous poutcome   y
## 14  30   apr      261        1    -1        0  unknown yes
## 31  14   jul      897        2    -1        0  unknown yes
## 34  26   aug      958        6    -1        0  unknown yes
## 35  13   aug      354        3    -1        0  unknown yes
## 37  22   oct       97        1    -1        0  unknown yes
## 38  14   nov      132        1    -1        0  unknown yes
cov(bank_y2$duration, bank_y2$y)
## [1] 33.28654

Insight: duration and customer decision to subscribe a term deposit has positive relationship

cor(bank_y2$duration, bank_y2$y)
## [1] 0.4011183

Insight: duration and customer decision to subscribe a term deposit has positive correlation

#Probability 
mean_dur <- mean(bank_y2$duration)
sd_dur <- sd(bank_y2$duration)

Z <- (433-mean_dur)/sd_dur
pnorm(Z, lower.tail = T)
## [1] 0.7423178

Insight: Probability of customers subscribed a term deposit is 74,23% based on average call duration.

What is the percentage of marketing campaign success?

prop.table(table(bank$poutcome))
## 
##    failure      other    success    unknown 
## 0.10838310 0.04357443 0.02853351 0.81950896

Insight: The percentage of success is only 0.028% and the percentage of failure is 0.108%

What is the percentage of marketing campaign results based on education?

prop.table(table(bank$education, bank$poutcome), margin = 2)
##            
##                failure      other    success    unknown
##   primary   0.12448980 0.11167513 0.11627907 0.15654521
##   secondary 0.51632653 0.51776650 0.46511628 0.51039136
##   tertiary  0.32040816 0.32994924 0.34108527 0.29257760
##   unknown   0.03877551 0.04060914 0.07751938 0.04048583

Insight: The most successful marketing campaigns were carried out for consumers with secondary education but secondary was also the number of marketing that failed the most.

What is the percentage of marketing campaign results based on job?

prop.table(xtabs(formula = balance ~ job + poutcome, data = bank), margin = 2)
##                poutcome
## job                 failure       other     success     unknown
##   admin.        0.080488710 0.087284986 0.156071801 0.089867731
##   blue-collar   0.150851805 0.100847763 0.103088192 0.167018790
##   entrepreneur  0.015238057 0.014973933 0.015524468 0.050255562
##   housemaid     0.059484264 0.006578267 0.019075531 0.035102233
##   management    0.333544697 0.341232481 0.140646747 0.257610333
##   retired       0.089162490 0.035699395 0.145183995 0.081476851
##   self-employed 0.014916668 0.091379476 0.010836110 0.042094174
##   services      0.080619003 0.085374936 0.042978598 0.070793998
##   student       0.024379651 0.102793447 0.026161750 0.014646838
##   technician    0.141986929 0.091960331 0.321007341 0.157297093
##   unemployed    0.008197281 0.034013848 0.015377335 0.023445463
##   unknown       0.001130445 0.007861137 0.004048132 0.010390933

Insight: The most successful marketing campaigns were conducted for consumers with technician jobs

Conslusion

From bank data in Portugal, there are 4521 customers and 17 variables related to customer behavior. The percentage of successful marketing campaigns that have been carried out is 0.028% with the most successful marketing campaigns being carried out for consumers with secondary education and technician jobs. Based on calculations, one of the factors that cause success is the duration of the call so by increasing engagement and call duration, customers will be more interested in subscribing to a term deposit.