Introduction

The objective of this project was to build classifiers to predict whether a person from variety walks of life would subscribe yes or no to a term deposit. The data I was using is from UCI website( Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/Bank+Marketing). This project has two phases. Phase I focuses on data preprocessing and exploration, as covered in this report. We shall present model building in Phase II. The rest of this report is organised as follow. Section 2 describes the data sets and their attributes. Section 3 covers data pre-processing. In Section 4, we explore each attribute and their inter-relationships. The last section ends with a summary.

Data Set

The UCI Machine Learning Repository provides two zip files, but only bank.zip, was useful in this project. bank.csv and bank.names are the full and description data sets respectively. bank-names contains the details of attributes or variables. The full data set has 45211 observations. I will take 80% of the sample with all attributes as trainning data and the rest is to test model. In Phase II, we would build the classifiers from the full the data set and evaluate their performance using cross-validation.

Target Feature

The response feature is income which is given

The target feature has two classes and hence it is a binary classification problem. To reiterate, The goal is to predict whether a person subcribe a deposit.

Descriptive Features

The variable description is produced here from bank.name file:

  • age: numeric
  • workclass: ype of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”, “student”, “blue-collar”,“self-employed”,“retired”,“technician”,“services”)
  • marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed)
  • education (categorical: “unknown”,“secondary”,“primary”,“tertiary”)
  • default: has credit in default? (binary: “yes”,“no”)
  • balance: average yearly balance, in euros (numeric)
  • housing: has housing loan? (binary: “yes”,“no”)
  • loan: has personal loan? (binary: “yes”,“no”)
  • contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”)
  • sex: Female, Male.day: last contact day of the month (numeric)
  • month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
  • duration: last contact duration, in seconds (numeric)
  • campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  • pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
  • previous: number of contacts performed before this campaign and for this client (numeric)
  • poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)

Data Pre-processing

Preliminaries

In this project, we used the following R packages.

bank <- read.csv("~/Downloads/bank.csv", stringsAsFactors = FALSE,
                 header = TRUE, sep = ';')

Data Cleaning

str(bank)
## 'data.frame':    4521 obs. of  17 variables:
##  $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
##  $ job      : chr  "unemployed" "services" "management" "management" ...
##  $ marital  : chr  "married" "married" "single" "married" ...
##  $ education: chr  "primary" "secondary" "tertiary" "tertiary" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
##  $ housing  : chr  "no" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "yes" "no" "yes" ...
##  $ contact  : chr  "cellular" "cellular" "cellular" "unknown" ...
##  $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
##  $ month    : chr  "oct" "may" "apr" "jun" ...
##  $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
##  $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
##  $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome : chr  "unknown" "failure" "failure" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...
summarizeColumns(bank) %>% knitr::kable( caption = 'Feature Summary before Data Preprocessing')
Feature Summary before Data Preprocessing
name type na mean disp median mad min max nlevs
age integer 0 41.1700951 10.5762110 39 10.3782 19 87 0
job character 0 NA 0.7856669 NA NA 38 969 12
marital character 0 NA 0.3813316 NA NA 528 2797 3
education character 0 NA 0.4899359 NA NA 187 2306 4
default character 0 NA 0.0168104 NA NA 76 4445 2
balance integer 0 1422.6578191 3009.6381425 444 658.2744 -3313 71188 0
housing character 0 NA 0.4339748 NA NA 1962 2559 2
loan character 0 NA 0.1528423 NA NA 691 3830 2
contact character 0 NA 0.3594338 NA NA 301 2896 3
day integer 0 15.9152842 8.2476673 16 10.3782 1 31 0
month character 0 NA 0.6907764 NA NA 20 1398 12
duration integer 0 263.9612917 259.8566326 185 143.8122 4 3025 0
campaign integer 0 2.7936297 3.1098067 2 1.4826 1 50 0
pdays integer 0 39.7666445 100.1211244 -1 0.0000 -1 871 0
previous integer 0 0.5425791 1.6935624 0 0.0000 0 25 0
poutcome character 0 NA 0.1804910 NA NA 129 3705 4
y character 0 NA 0.1152400 NA NA 521 4000 2

We removed the excessive white spaces for all character features.

bank[, sapply( bank, is.character )] <- sapply( bank[,sapply(bank, is.character )], trimws) 
## remove exessive white spaces
table(bank$y)
## 
##   no  yes 
## 4000  521

It seems to me that the origional data is pretty clean and has no missing variable. Now we are going to an interesting part: Visualisation.

Quick Visualization

  • Box plot for age:
boxplot( data= bank, bank$age, col = "darkblue") 

  • Distribution of age:
ggplot(bank, aes(x=age)) + 
  geom_histogram(aes(y=..density..), colour="grey", fill="lightblue", linetype="dashed")+
  geom_density(alpha=.2, fill="#FF6666")+ 
  geom_vline(aes(xintercept= mean(age)), color="blue", linetype="dashed", size=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  • Box plot for duration:
boxplot( data= bank, bank$duration, col = "darkblue",boxwex = 0.75) 

  • Histogram for duration:
ggplot(bank, aes(x=duration)) + 
  geom_histogram(aes(y=..density..), colour="grey", fill="lightblue", linetype="dashed")+
  geom_density(alpha=.2)+ 
  geom_vline(aes(xintercept= mean(duration)), color="blue",
             linetype="dashed", size=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  • Histogram for duration with deposit
ggplot(bank, aes(x=duration, fill=y)) + geom_density(alpha=.3)

  • Explore the people who made a deposit and job category:
NROW(na.omit(bank))
## [1] 4521
bank %>%
  group_by(job) %>%
  summarise(
    count = sum(y == 'yes')) 
## # A tibble: 12 x 2
##    job           count
##    <chr>         <int>
##  1 admin.           58
##  2 blue-collar      69
##  3 entrepreneur     15
##  4 housemaid        14
##  5 management      131
##  6 retired          54
##  7 self-employed    20
##  8 services         38
##  9 student          19
## 10 technician       83
## 11 unemployed       13
## 12 unknown           7

From here, I would like to combine similar jobs into categories. In which: - admin. and management will be white-collar; - services and housemaid will be pink-collar; - retired and student, uneployed, unknown will be other;

bank$job <- as.character(bank$job)
bank$job[bank$job %in% c("admin.","management")] <- "white-collar"
bank$job[bank$job %in% c('services','housemaid')] <- "pink-collar"
bank$job[bank$job %in% c('retired', 'student', 'unemployed', 'unknown')] <- "other"
head(bank)
##   age          job marital education default balance housing loan  contact
## 1  30        other married   primary      no    1787      no   no cellular
## 2  33  pink-collar married secondary      no    4789     yes  yes cellular
## 3  35 white-collar  single  tertiary      no    1350     yes   no cellular
## 4  30 white-collar married  tertiary      no    1476     yes  yes  unknown
## 5  59  blue-collar married secondary      no       0     yes   no  unknown
## 6  35 white-collar  single  tertiary      no     747      no   no cellular
##   day month duration campaign pdays previous poutcome  y
## 1  19   oct       79        1    -1        0  unknown no
## 2  11   may      220        1   339        4  failure no
## 3  16   apr      185        1   330        1  failure no
## 4   3   jun      199        4    -1        0  unknown no
## 5   5   may      226        1    -1        0  unknown no
## 6  23   feb      141        2   176        3  failure no
bank$job <- factor(bank$job)
bank %>%
  group_by(job) %>%
  summarise(
    count = sum(y == 'yes')) 
## # A tibble: 7 x 2
##   job           count
##   <fct>         <int>
## 1 blue-collar      69
## 2 entrepreneur     15
## 3 other            93
## 4 pink-collar      52
## 5 self-employed    20
## 6 technician       83
## 7 white-collar    189

Data Transformation and Exploration

We explored the data for each feature individually and split them by the classes of target features. Then we proceeded to multivariate visualisation.

Data transformation

Poutcome

bank$poutcome <- as.character(bank$poutcome)
bank$poutcome[bank$poutcome %in% c("other")] <- "unknown"
head(bank$poutcome)
## [1] "unknown" "failure" "failure" "unknown" "unknown" "failure"

Default, Loan, Housing and Target variable y

Value for these variable from yes/no(character) to 1/0(numeric)

bank$default<-ifelse(bank$default =="yes", 1,0)
bank$housing<-ifelse(bank$housing =='yes', 1,0)
bank$loan<-ifelse(bank$loan=='yes', 1,0)
bank$y <- ifelse(bank$y == 'yes', 1, 0)
head(bank)
##   age          job marital education default balance housing loan  contact
## 1  30        other married   primary       0    1787       0    0 cellular
## 2  33  pink-collar married secondary       0    4789       1    1 cellular
## 3  35 white-collar  single  tertiary       0    1350       1    0 cellular
## 4  30 white-collar married  tertiary       0    1476       1    1  unknown
## 5  59  blue-collar married secondary       0       0       1    0  unknown
## 6  35 white-collar  single  tertiary       0     747       0    0 cellular
##   day month duration campaign pdays previous poutcome y
## 1  19   oct       79        1    -1        0  unknown 0
## 2  11   may      220        1   339        4  failure 0
## 3  16   apr      185        1   330        1  failure 0
## 4   3   jun      199        4    -1        0  unknown 0
## 5   5   may      226        1    -1        0  unknown 0
## 6  23   feb      141        2   176        3  failure 0

Drop variables

bank$day <- NULL
bank$month <- NULL
bank$contact <- NULL

Convert categorical variable into dummy

My aim is to give a matrix correlation of variables there fore I will transfer all categorical variables in to dummy ones.

# Create a copy of processed bank data
bank_origional <- bank

In the data, job, marital, education and poutcome are categorical variables. I will transfer it into dummy ones by following these code chucks:

Data Visualisation in the whole population

Scatterplot showing age and balance

Bar chart of job Vs the people they made a deposit

Heat map

bankcp<- bank # Create a copy of dummy data
corr <- cor(bank) # Create a matrix of correlation
heatmap(corr, Rowv = NA, Colv = NA)

Train and Test datasets

smp_size <- floor(0.8*nrow(bank))
set.seed(123)
train_index <- sample(seq(nrow(bank)), size = smp_size)
train <- bank[train_index, ]
test <- bank[-train_index, ]

Summary

The origional data has no missing value which is an advantage because in the reality, we have to do it ourselves.

With the variable like day, month, and contact, I deleted them as their information is not usefull for predicting models. With pdays, I add the value of “-1” as 10000 to present that it is so far in the past that it has no effect. Furthermore, a re-arragement and combination has been applied for job and poutcome variables. Also, I changed categorical variables into dummy in order to creat and visualize a heatmap matrix of correlations.