Explanatory Data for Marketing of Deposit Products in Portuguese banking

1. Overview Data :“Bank-Additional-Full.csv”

Data “bank-additional-full.csv” is a deposit product marketing data which is based on the phone calls of each client at a Portuguese banking institution. Data “bank-additional-full.csv” is data on the marketing of deposit products based on phone calls of each client at a Portuguese banking institution. The data consists of 20 input variables containing the client’s identity. This data is used to predict which clients will buy deposit products when the bank calls them.

2. Data Inspection

Data inspection helps us determine that data import has been executed correctly, that variables are in same length (rows) and breadth (columns) and that variables (columns) are in the same format as expected. In the R program, there are several basic functions that are often used for data inspection, namely read.csv(), head(),tail(),dim(), names().

a.Use read.csv() for input and read data

That must be considered in inputting data is that “data.csv” (data used) is in the same folder as the R project.

bank.data<-read.csv2("bank-additional-full.csv")

b. Use head() and tail()

head(bank.data)
  age       job marital   education default housing loan   contact month
1  56 housemaid married    basic.4y      no      no   no telephone   may
2  57  services married high.school unknown      no   no telephone   may
3  37  services married high.school      no     yes   no telephone   may
  day_of_week duration campaign pdays previous    poutcome emp.var.rate
1         mon      261        1   999        0 nonexistent          1.1
2         mon      149        1   999        0 nonexistent          1.1
3         mon      226        1   999        0 nonexistent          1.1
  cons.price.idx cons.conf.idx euribor3m nr.employed  y
1         93.994         -36.4     4.857        5191 no
2         93.994         -36.4     4.857        5191 no
3         93.994         -36.4     4.857        5191 no
 [ reached 'max' / getOption("max.print") -- omitted 3 rows ]
tail(bank.data)
      age         job marital           education default housing loan  contact
41183  29  unemployed  single            basic.4y      no     yes   no cellular
41184  73     retired married professional.course      no     yes   no cellular
41185  46 blue-collar married professional.course      no      no   no cellular
      month day_of_week duration campaign pdays previous    poutcome
41183   nov         fri      112        1     9        1     success
41184   nov         fri      334        1   999        0 nonexistent
41185   nov         fri      383        1   999        0 nonexistent
      emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed   y
41183         -1.1         94.767         -50.8     1.028      4963.6  no
41184         -1.1         94.767         -50.8     1.028      4963.6 yes
41185         -1.1         94.767         -50.8     1.028      4963.6  no
 [ reached 'max' / getOption("max.print") -- omitted 3 rows ]

c. Use dim() and names()

dim(bank.data)
[1] 41188    21
names(bank.data)
 [1] "age"            "job"            "marital"        "education"     
 [5] "default"        "housing"        "loan"           "contact"       
 [9] "month"          "day_of_week"    "duration"       "campaign"      
[13] "pdays"          "previous"       "poutcome"       "emp.var.rate"  
[17] "cons.price.idx" "cons.conf.idx"  "euribor3m"      "nr.employed"   
[21] "y"             

3. Data Cleansing

Check data type for each column

str(bank.data)
'data.frame':   41188 obs. of  21 variables:
 $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
 $ job           : chr  "housemaid" "services" "services" "admin." ...
 $ marital       : chr  "married" "married" "married" "married" ...
 $ education     : chr  "basic.4y" "high.school" "high.school" "basic.6y" ...
 $ default       : chr  "no" "unknown" "no" "no" ...
 $ housing       : chr  "no" "no" "yes" "no" ...
 $ loan          : chr  "no" "no" "no" "no" ...
 $ contact       : chr  "telephone" "telephone" "telephone" "telephone" ...
 $ month         : chr  "may" "may" "may" "may" ...
 $ day_of_week   : chr  "mon" "mon" "mon" "mon" ...
 $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
 $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
 $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ poutcome      : chr  "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
 $ emp.var.rate  : chr  "1.1" "1.1" "1.1" "1.1" ...
 $ cons.price.idx: chr  "93.994" "93.994" "93.994" "93.994" ...
 $ cons.conf.idx : chr  "-36.4" "-36.4" "-36.4" "-36.4" ...
 $ euribor3m     : chr  "4.857" "4.857" "4.857" "4.857" ...
 $ nr.employed   : chr  "5191" "5191" "5191" "5191" ...
 $ y             : chr  "no" "no" "no" "no" ...

There are some data types that are still wrong, because we have to change them

bank.data$job<-as.factor(bank.data$job)
bank.data$marital<-as.factor(bank.data$marital)
bank.data$education<-as.factor(bank.data$education)
bank.data$default<-as.factor(bank.data$default)
bank.data$housing<-as.factor(bank.data$housing)
bank.data$loan<-as.factor(bank.data$loan)
bank.data$month<-as.factor(bank.data$month)
bank.data$day_of_week<-as.factor(bank.data$day_of_week)
bank.data$poutcome<-as.factor(bank.data$poutcome)
bank.data$emp.var.rate<-as.numeric(bank.data$emp.var.rate)
bank.data$cons.price.idx<-as.numeric(bank.data$cons.price.idx)
bank.data$cons.conf.idx<-as.numeric(bank.data$cons.conf.idx)
bank.data$euribor3m<-as.numeric(bank.data$euribor3m)
bank.data$nr.employed<-as.numeric(bank.data$nr.employed)
bank.data$duration<-as.integer(bank.data$duration)

str(bank.data)
'data.frame':   41188 obs. of  21 variables:
 $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
 $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
 $ marital       : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
 $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
 $ default       : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
 $ housing       : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
 $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
 $ contact       : chr  "telephone" "telephone" "telephone" "telephone" ...
 $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
 $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
 $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
 $ cons.price.idx: num  94 94 94 94 94 ...
 $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
 $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
 $ nr.employed   : num  5191 5191 5191 5191 5191 ...
 $ y             : chr  "no" "no" "no" "no" ...

checking for missing value

colSums(is.na(bank.data))
           age            job        marital      education        default 
             0              0              0              0              0 
       housing           loan        contact          month    day_of_week 
             0              0              0              0              0 
      duration       campaign          pdays       previous       poutcome 
             0              0              0              0              0 
  emp.var.rate cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
             0              0              0              0              0 
             y 
             0 
anyNA(bank.data)
[1] FALSE

4. Data Explanation

summary(bank.data)
      age                 job            marital     
 Min.   :17.00   admin.     :10422   divorced: 4612  
 1st Qu.:32.00   blue-collar: 9254   married :24928  
 Median :38.00   technician : 6743   single  :11568  
               education        default         housing           loan      
 university.degree  :12168   no     :32588   no     :18622   no     :33950  
 high.school        : 9515   unknown: 8597   unknown:  990   unknown:  990  
 basic.9y           : 6045   yes    :    3   yes    :21576   yes    : 6248  
   contact              month       day_of_week    duration     
 Length:41188       may    :13769   fri:7827    Min.   :   0.0  
 Class :character   jul    : 7174   mon:8514    1st Qu.: 102.0  
 Mode  :character   aug    : 6178   thu:8623    Median : 180.0  
    campaign          pdays          previous            poutcome    
 Min.   : 1.000   Min.   :  0.0   Min.   :0.000   failure    : 4252  
 1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.000   nonexistent:35563  
 Median : 2.000   Median :999.0   Median :0.000   success    : 1373  
  emp.var.rate      cons.price.idx  cons.conf.idx     euribor3m    
 Min.   :-3.40000   Min.   :92.20   Min.   :-50.8   Min.   :0.634  
 1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.344  
 Median : 1.10000   Median :93.75   Median :-41.8   Median :4.857  
  nr.employed        y            
 Min.   :4964   Length:41188      
 1st Qu.:5099   Class :character  
 Median :5191   Mode  :character  
 [ reached getOption("max.print") -- omitted 4 rows ]

Insight from summary:

  1. Minimum age of the client is 17 years.
  2. Most of the clients have jobs as admin.
  3. Most of the clients had recent education at university degree.
  4. There are three clients who have credit defaults.
  5. Most of the clients have housing loan.
  6. Most of the clients do not have personal loans.
  7. Month / day client’s last contact is often in May / Thursday.
  8. The median duration of client telephone contact was three minutes.
  9. The number of contacts that were called during the marketing of 2000 contacts
  10. The result of the previous marketing was nonexsiten

Check the Outlier within duration

aggregate(duration~poutcome,bank.data,mean)
     poutcome duration
1     failure 248.1129
2 nonexistent 257.0822
3     success 320.9417
aggregate(duration~poutcome,bank.data,var)
     poutcome duration
1     failure 53471.00
2 nonexistent 68778.65
3     success 65403.03
aggregate(duration~poutcome,bank.data,sd)
     poutcome duration
1     failure 231.2380
2 nonexistent 262.2568
3     success 255.7402
boxplot(bank.data$duration)

From the results of the visualization above we find the possibility for outliers, but from our calculation the SD value is around 200 which is still tolerable, so the process can be continued.

5. Bussines Question

  1. Which is job gives the lowest duration number?
bank.data[bank.data$duration == 0,]
      age         job  marital         education default housing loan   contact
6252   39      admin.  married       high.school      no     yes   no telephone
23032  59  management  married university.degree      no     yes   no  cellular
28064  53 blue-collar divorced       high.school      no     yes   no  cellular
      month day_of_week duration campaign pdays previous    poutcome
6252    may         tue        0        4   999        0 nonexistent
23032   aug         tue        0       10   999        0 nonexistent
28064   apr         fri        0        3   999        0 nonexistent
      emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed  y
6252           1.1         93.994         -36.4     4.857      5191.0 no
23032          1.4         93.444         -36.1     4.965      5228.1 no
28064         -1.8         93.075         -47.1     1.479      5099.1 no
 [ reached 'max' / getOption("max.print") -- omitted 1 rows ]

Answer: The client jobs with the lowest call duration are admin, management, and blue collar

2. what jobs he called on that Friday in April?

Answer:

no2<-bank.data[bank.data$month=="apr"& bank.data$day_of_week=="fri",]
(table(no2$job))

       admin.   blue-collar  entrepreneur     housemaid    management 
          149           166            26            10            56 
      retired self-employed      services       student    technician 
           17            12            62            20            79 
   unemployed       unknown 
           12             1 
  1. In April, July, July, clients who do not have personal loan and job have the highest campaign mean?

    Answer:

no3<-bank.data[bank.data$month==c("apr","jun","jul"),]
dt<-aggregate(campaign~job+default,no2,mean)
dt[dt$default=="no",]
             job default campaign
1         admin.      no 1.870504
2    blue-collar      no 1.775000
3   entrepreneur      no 1.708333
4      housemaid      no 1.625000
5     management      no 1.688889
6        retired      no 1.785714
7  self-employed      no 1.750000
8       services      no 1.777778
9        student      no 2.529412
10    technician      no 1.638889
11    unemployed      no 1.800000
12       unknown      no 2.000000