1.Explanation

The data is related with direct marketing campaigns of a Portuguese banking institution.We are trying to analyze clients of a Portuguese banking institution which we classified based on 16 categories. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

2.Input and Checking Data

bank <- read.csv("bank/bank-full.csv",sep = ";")
str(bank)
## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : chr  "management" "technician" "entrepreneur" "blue-collar" ...
##  $ marital  : chr  "married" "single" "married" "married" ...
##  $ education: chr  "tertiary" "secondary" "secondary" "unknown" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : chr  "yes" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "no" "yes" "no" ...
##  $ contact  : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : chr  "may" "may" "may" "may" ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...
str(bank)
## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : chr  "management" "technician" "entrepreneur" "blue-collar" ...
##  $ marital  : chr  "married" "single" "married" "married" ...
##  $ education: chr  "tertiary" "secondary" "secondary" "unknown" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : chr  "yes" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "no" "yes" "no" ...
##  $ contact  : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : chr  "may" "may" "may" "may" ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...

3.Inspecting data

head(bank)
##   age          job marital education default balance housing loan contact day
## 1  58   management married  tertiary      no    2143     yes   no unknown   5
## 2  44   technician  single secondary      no      29     yes   no unknown   5
## 3  33 entrepreneur married secondary      no       2     yes  yes unknown   5
## 4  47  blue-collar married   unknown      no    1506     yes   no unknown   5
## 5  33      unknown  single   unknown      no       1      no   no unknown   5
## 6  35   management married  tertiary      no     231     yes   no unknown   5
##   month duration campaign pdays previous poutcome  y
## 1   may      261        1    -1        0  unknown no
## 2   may      151        1    -1        0  unknown no
## 3   may       76        1    -1        0  unknown no
## 4   may       92        1    -1        0  unknown no
## 5   may      198        1    -1        0  unknown no
## 6   may      139        1    -1        0  unknown no
tail(bank)
##       age          job  marital education default balance housing loan
## 45206  25   technician   single secondary      no     505      no  yes
## 45207  51   technician  married  tertiary      no     825      no   no
## 45208  71      retired divorced   primary      no    1729      no   no
## 45209  72      retired  married secondary      no    5715      no   no
## 45210  57  blue-collar  married secondary      no     668      no   no
## 45211  37 entrepreneur  married secondary      no    2971      no   no
##         contact day month duration campaign pdays previous poutcome   y
## 45206  cellular  17   nov      386        2    -1        0  unknown yes
## 45207  cellular  17   nov      977        3    -1        0  unknown yes
## 45208  cellular  17   nov      456        2    -1        0  unknown yes
## 45209  cellular  17   nov     1127        5   184        3  success yes
## 45210 telephone  17   nov      508        4    -1        0  unknown  no
## 45211  cellular  17   nov      361        2   188       11    other  no
dim(bank)
## [1] 45211    17
names(bank)
##  [1] "age"       "job"       "marital"   "education" "default"   "balance"  
##  [7] "housing"   "loan"      "contact"   "day"       "month"     "duration" 
## [13] "campaign"  "pdays"     "previous"  "poutcome"  "y"

From our inspection we can conclude:

-bank data contain 9994 of rows and 15 of columns

Each column name with it’s explanation:

1 - age

2 - job : type of job

3 - marital : marital status

4 - education

5 - default: has credit in default?

6 - balance

7 - housing: has housing loan?

8 - loan: has personal loan?

9 - contact: contact communication type

10 - month: last contact month of year

11 - day_of_week: last contact day of the week

12 - duration: last contact duration, in seconds

13 - campaign: number of contacts performed during this campaign and for this client

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign

15 - previous: number of contacts performed before this campaign and for this client

16 - poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)

17 - y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

4.Data cleansing & coertions

check data type for each column

str(bank)
## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : chr  "management" "technician" "entrepreneur" "blue-collar" ...
##  $ marital  : chr  "married" "single" "married" "married" ...
##  $ education: chr  "tertiary" "secondary" "secondary" "unknown" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : chr  "yes" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "no" "yes" "no" ...
##  $ contact  : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : chr  "may" "may" "may" "may" ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...

From this result, we find some of data type not in the corect type. we need to convert it into corect type (data coertion)

names <- c("job","marital","education","default","housing","loan","contact","month","day","poutcome","y")
bank[,names] <- lapply(bank[,names] , as.factor)

str(bank)
## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day      : Factor w/ 31 levels "1","2","3","4",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Each of column had already changed into each own desired data type

check for missing values

anyNA(bank)
## [1] FALSE

Great!! No missing value

Now, the bank dataset is ready to be processed and analyzed

5.Data Explanation

Brief explanation

summary(bank)
##       age                 job           marital          education    
##  Min.   :18.00   blue-collar:9732   divorced: 5207   primary  : 6851  
##  1st Qu.:33.00   management :9458   married :27214   secondary:23202  
##  Median :39.00   technician :7597   single  :12790   tertiary :13301  
##  Mean   :40.94   admin.     :5171                    unknown  : 1857  
##  3rd Qu.:48.00   services   :4154                                     
##  Max.   :95.00   retired    :2264                                     
##                  (Other)    :6835                                     
##  default        balance       housing      loan            contact     
##  no :44396   Min.   : -8019   no :20081   no :37967   cellular :29285  
##  yes:  815   1st Qu.:    72   yes:25130   yes: 7244   telephone: 2906  
##              Median :   448                           unknown  :13020  
##              Mean   :  1362                                            
##              3rd Qu.:  1428                                            
##              Max.   :102127                                            
##                                                                        
##       day            month          duration         campaign     
##  20     : 2752   may    :13766   Min.   :   0.0   Min.   : 1.000  
##  18     : 2308   jul    : 6895   1st Qu.: 103.0   1st Qu.: 1.000  
##  21     : 2026   aug    : 6247   Median : 180.0   Median : 2.000  
##  17     : 1939   jun    : 5341   Mean   : 258.2   Mean   : 2.764  
##  6      : 1932   nov    : 3970   3rd Qu.: 319.0   3rd Qu.: 3.000  
##  5      : 1910   apr    : 2932   Max.   :4918.0   Max.   :63.000  
##  (Other):32344   (Other): 6060                                    
##      pdays          previous           poutcome       y        
##  Min.   : -1.0   Min.   :  0.0000   failure: 4901   no :39922  
##  1st Qu.: -1.0   1st Qu.:  0.0000   other  : 1840   yes: 5289  
##  Median : -1.0   Median :  0.0000   success: 1511              
##  Mean   : 40.2   Mean   :  0.5803   unknown:36959              
##  3rd Qu.: -1.0   3rd Qu.:  0.0000                              
##  Max.   :871.0   Max.   :275.0000                              
## 
summary(bank$job)
##        admin.   blue-collar  entrepreneur     housemaid    management 
##          5171          9732          1487          1240          9458 
##       retired self-employed      services       student    technician 
##          2264          1579          4154           938          7597 
##    unemployed       unknown 
##          1303           288
summary(bank$education)
##   primary secondary  tertiary   unknown 
##      6851     23202     13301      1857
prop.table(table(bank$month)) * 100
## 
##        apr        aug        dec        feb        jan        jul        jun 
##  6.4851474 13.8174338  0.4733361  5.8591936  3.1032271 15.2507133 11.8134967 
##        mar        may        nov        oct        sep 
##  1.0550530 30.4483422  8.7810489  1.6323461  1.2806618

Summary:

1.The minimum age of the clients was 18 meanwhile the maximum age was 95

2.Most of the client’s jobs came from blue-collar and management field

3.around 51 percent of the clients were secondary education-background

4.44396 from 45211 clients had no credit in default but there were 7244 clients who had personal loans

5.approximately 30 percent of the clients were called on may

6.roughly there were more than 75 percent of the clients had never been contacted before

7.From the data we can see that only 5289 from 45211 decided to subscribe the term deposit

6.Data Manipulation and Transformation

#consider only the data which the clients subscribe
bank_yes <- bank[bank$y == "yes" , ]

1.What were the range of age did subscribe deposit the most

temp <- prop.table(table(bank_yes$age))*100
plot(temp,type = "o")

The plot is interesting because it looks like a hill.

As we can see from the plot that from 18 until 29 years old, the amount of clients were increasing and then the peak was from 30 to 35 years old,but then the trend was decreasing afterwards

  1. How many clients subscribe deposit from each type of job
table(bank_yes$job) 
## 
##        admin.   blue-collar  entrepreneur     housemaid    management 
##           631           708           123           109          1301 
##       retired self-employed      services       student    technician 
##           516           187           369           269           840 
##    unemployed       unknown 
##           202            34

Most subscribers came from management field which is 1301 clients and the least subscribers came from the unknown job which contributes 34 clients

3.From each educational status what were the percentage of them subscribed the term deposit

prop.table(table(bank$education,bank$y),margin = 1)
##            
##                     no        yes
##   primary   0.91373522 0.08626478
##   secondary 0.89440565 0.10559435
##   tertiary  0.84993610 0.15006390
##   unknown   0.86429725 0.13570275

From the prop table we could see that from each educational status, the amount that deposit was small, only from range 8% (primary sector) to 15%(tertiary sector)

4.Balance information of clients who subscribed the deposit

nrow(bank_yes)
## [1] 5289
summary(bank_yes$balance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -3058     210     733    1804    2159   81204
boxplot(bank_yes$balance)

From 5289 data,the median was 752$ and the mean is 1828$ while the max balance was 81204$, the boxplot could tell us that there were many high outliers

5.What job had the most and least personal loan

temp<-bank
temp$loan <- sapply(X = as.character(temp$loan),FUN = switch,"no" = "0","yes" = "1")
temp$loan <- as.integer(temp$loan)
agg <- aggregate.data.frame(list(has_personal_loan = temp$loan),by = list(job = temp$job),FUN = sum)
agg <- agg[order(agg$has_personal_loan,decreasing = T),]
agg
##              job has_personal_loan
## 2    blue-collar              1684
## 10    technician              1309
## 5     management              1253
## 1         admin.               991
## 8       services               836
## 3   entrepreneur               356
## 6        retired               309
## 7  self-employed               229
## 4      housemaid               152
## 11    unemployed               109
## 9        student                12
## 12       unknown                 4

We could see from the data that the job that had the most personal loan came from blue-collar sector meanwhile the least persoal loan came from unemployed sector

6.Meanwhile what job had the most and least housing loan

temp<-bank
temp$housing <- sapply(X = as.character(temp$housing),FUN = switch,"no" = "0","yes" = "1")
temp$housing <- as.integer(temp$housing)
agg <- aggregate.data.frame(list(has_housing_loan = temp$housing),by = list(job = temp$job),FUN = sum)
agg <- agg[order(agg$has_housing_loan,decreasing = T),]
agg
##              job has_housing_loan
## 2    blue-collar             7048
## 5     management             4678
## 10    technician             4115
## 1         admin.             3182
## 8       services             2766
## 3   entrepreneur              869
## 7  self-employed              765
## 11    unemployed              543
## 6        retired              491
## 4      housemaid              398
## 9        student              249
## 12       unknown               26

The job that had the most housing loan was still blue-collar but the least housing loan was housemaid

7.Was there a big correlation between the one who had personal loan and housing loan

cor(x = bank$housing == "yes" ,y = bank$loan == "yes")
## [1] 0.04132287
nrow(bank[bank$housing == "yes" & bank$loan == "yes",])
## [1] 4367

The correlation of the same client had both personal loan and housing loan were so small. There were only 4367 from 45211 clients

8.What is the percentage of the clients who didn’t have housing loan neither personal loan subscribe the deposit over all of the client that subscribe the deposit

x <- nrow(bank_yes[bank_yes$housing == "no" & bank_yes$loan == "no",])
y <- nrow(bank_yes)
x/y * 100
## [1] 59.27396

59% who subscribe the deposit didn’t have housing loan neither personal loan

9.What is the mean and median last call duration before the client subscribe the deposit

durasi <- bank_yes$duration
#because the default is in seconds we convert first into minutes
durasi <- durasi / 60
hist(durasi) 

mean(durasi)
## [1] 8.95491
median(durasi)
## [1] 7.1

The mean of last call duration was 8.9 minutes while the median was 7.1 minutes

10.Simple statistics of the number of times did the bank need to call before the client subscribed

kampain <- bank_yes$campaign
fivenum(kampain)
## [1]  1  1  2  3 32
mean(kampain)
## [1] 2.141047
boxplot(kampain)

The mean was 2.14

The minimum and lower quartile was 1,the median was 2 , the upper quartile was 3 and the maximum was 32

The data from the fivenum and boxplot could tell us that there were some high outliers

11.We could see that the mean and the median, both were around at 2 times calling.From all clients that had been called 2 times,what was the percentage that they refused to subscribe the deposit

nrow(bank[bank$campaign == 2 & bank$y == "no",])/nrow(bank[bank$campaign == 2,]) * 100
## [1] 88.79648

The interesting part was 88% of clients that had been called twice refused to subscribe the deposit