Introduction

The objective of this project was to build classifiers to predict whether a person from variety walks of life would subscribe yes or no to a term deposit. The data I was using is from UCI website( Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/Bank+Marketing). This project has two phases. Phase I focuses on data preprocessing and exploration, as covered in this report. We shall present model building in Phase II. The rest of this report is organised as follow. Section 2 describes the data sets and their attributes. Section 3 covers data pre-processing. In Section 4, we explore each attribute and their inter-relationships. The last section ends with a summary.

Data Set

The UCI Machine Learning Repository provides two zip files, but only bank.zip, was useful in this project. bank.csv and bank.names are the full and description data sets respectively. bank-names contains the details of attributes or variables. The full data set has 45211 observations. I will take 80% of the sample with all attributes as trainning data and the rest is to test model. In Phase II, we would build the classifiers from the full the data set and evaluate their performance using cross-validation.

Target Feature

The response feature is income which is given

The target feature has two classes and hence it is a binary classification problem. To reiterate, The goal is to predict whether a person subcribe a deposit.

Descriptive Features

The variable description is produced here from bank.name file:

age: numeric
workclass: ype of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”, “student”, “blue-collar”,“self-employed”,“retired”,“technician”,“services”)
marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed)
education (categorical: “unknown”,“secondary”,“primary”,“tertiary”)
default: has credit in default? (binary: “yes”,“no”)
balance: average yearly balance, in euros (numeric)
housing: has housing loan? (binary: “yes”,“no”)
loan: has personal loan? (binary: “yes”,“no”)
contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”)
sex: Female, Male.day: last contact day of the month (numeric)
month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
duration: last contact duration, in seconds (numeric)
campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
previous: number of contacts performed before this campaign and for this client (numeric)
poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)

Data Pre-processing

Preliminaries

In this project, we used the following R packages.

bank <- read.csv("~/Downloads/bank.csv", stringsAsFactors = FALSE,
                 header = TRUE, sep = ';')

Data Cleaning

str(bank)

## 'data.frame':    4521 obs. of  17 variables:
##  $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
##  $ job      : chr  "unemployed" "services" "management" "management" ...
##  $ marital  : chr  "married" "married" "single" "married" ...
##  $ education: chr  "primary" "secondary" "tertiary" "tertiary" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
##  $ housing  : chr  "no" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "yes" "no" "yes" ...
##  $ contact  : chr  "cellular" "cellular" "cellular" "unknown" ...
##  $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
##  $ month    : chr  "oct" "may" "apr" "jun" ...
##  $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
##  $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
##  $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome : chr  "unknown" "failure" "failure" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...

summarizeColumns(bank) %>% knitr::kable( caption = 'Feature Summary before Data Preprocessing')

Feature Summary before Data Preprocessing
name	type	mean	disp	median	mad	min	max	nlevs
age	integer	41.1700951	10.5762110	39	10.3782	19	87	0
job	character	NA	0.7856669	NA	NA	38	969	12
marital	character	NA	0.3813316	NA	NA	528	2797	3
education	character	NA	0.4899359	NA	NA	187	2306	4
default	character	NA	0.0168104	NA	NA	76	4445	2
balance	integer	1422.6578191	3009.6381425	444	658.2744	-3313	71188	0
housing	character	NA	0.4339748	NA	NA	1962	2559	2
loan	character	NA	0.1528423	NA	NA	691	3830	2
contact	character	NA	0.3594338	NA	NA	301	2896	3
day	integer	15.9152842	8.2476673	16	10.3782	1	31	0
month	character	NA	0.6907764	NA	NA	20	1398	12
duration	integer	263.9612917	259.8566326	185	143.8122	4	3025	0
campaign	integer	2.7936297	3.1098067	2	1.4826	1	50	0
pdays	integer	39.7666445	100.1211244	-1	0.0000	-1	871	0
previous	integer	0.5425791	1.6935624	0	0.0000	0	25	0
poutcome	character	NA	0.1804910	NA	NA	129	3705	4
y	character	NA	0.1152400	NA	NA	521	4000	2

We removed the excessive white spaces for all character features.

bank[, sapply( bank, is.character )] <- sapply( bank[,sapply(bank, is.character )], trimws) 
## remove exessive white spaces
table(bank$y)

## 
##   no  yes 
## 4000  521

It seems to me that the origional data is pretty clean and has no missing variable. Now we are going to an interesting part: Visualisation.

Quick Visualization

Box plot for age:

boxplot( data= bank, bank$age, col = "darkblue")

Distribution of age:

ggplot(bank, aes(x=age)) + 
  geom_histogram(aes(y=..density..), colour="grey", fill="lightblue", linetype="dashed")+
  geom_density(alpha=.2, fill="#FF6666")+ 
  geom_vline(aes(xintercept= mean(age)), color="blue", linetype="dashed", size=1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Box plot for duration:

boxplot( data= bank, bank$duration, col = "darkblue",boxwex = 0.75)

Histogram for duration:

ggplot(bank, aes(x=duration)) + 
  geom_histogram(aes(y=..density..), colour="grey", fill="lightblue", linetype="dashed")+
  geom_density(alpha=.2)+ 
  geom_vline(aes(xintercept= mean(duration)), color="blue",
             linetype="dashed", size=1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Histogram for duration with deposit

ggplot(bank, aes(x=duration, fill=y)) + geom_density(alpha=.3)

Explore the people who made a deposit and job category:

NROW(na.omit(bank))

## [1] 4521

bank %>%
  group_by(job) %>%
  summarise(
    count = sum(y == 'yes'))

## # A tibble: 12 x 2
##    job           count
##    <chr>         <int>
##  1 admin.           58
##  2 blue-collar      69
##  3 entrepreneur     15
##  4 housemaid        14
##  5 management      131
##  6 retired          54
##  7 self-employed    20
##  8 services         38
##  9 student          19
## 10 technician       83
## 11 unemployed       13
## 12 unknown           7

From here, I would like to combine similar jobs into categories. In which: - admin. and management will be white-collar; - services and housemaid will be pink-collar; - retired and student, uneployed, unknown will be other;

bank$job <- as.character(bank$job)
bank$job[bank$job %in% c("admin.","management")] <- "white-collar"
bank$job[bank$job %in% c('services','housemaid')] <- "pink-collar"
bank$job[bank$job %in% c('retired', 'student', 'unemployed', 'unknown')] <- "other"
head(bank)

##   age          job marital education default balance housing loan  contact
## 1  30        other married   primary      no    1787      no   no cellular
## 2  33  pink-collar married secondary      no    4789     yes  yes cellular
## 3  35 white-collar  single  tertiary      no    1350     yes   no cellular
## 4  30 white-collar married  tertiary      no    1476     yes  yes  unknown
## 5  59  blue-collar married secondary      no       0     yes   no  unknown
## 6  35 white-collar  single  tertiary      no     747      no   no cellular
##   day month duration campaign pdays previous poutcome  y
## 1  19   oct       79        1    -1        0  unknown no
## 2  11   may      220        1   339        4  failure no
## 3  16   apr      185        1   330        1  failure no
## 4   3   jun      199        4    -1        0  unknown no
## 5   5   may      226        1    -1        0  unknown no
## 6  23   feb      141        2   176        3  failure no

bank$job <- factor(bank$job)
bank %>%
  group_by(job) %>%
  summarise(
    count = sum(y == 'yes'))

## # A tibble: 7 x 2
##   job           count
##   <fct>         <int>
## 1 blue-collar      69
## 2 entrepreneur     15
## 3 other            93
## 4 pink-collar      52
## 5 self-employed    20
## 6 technician       83
## 7 white-collar    189

Data Transformation and Exploration

We explored the data for each feature individually and split them by the classes of target features. Then we proceeded to multivariate visualisation.

Data transformation

Poutcome

bank$poutcome <- as.character(bank$poutcome)
bank$poutcome[bank$poutcome %in% c("other")] <- "unknown"
head(bank$poutcome)

## [1] "unknown" "failure" "failure" "unknown" "unknown" "failure"

Default, Loan, Housing and Target variable y

Value for these variable from yes/no(character) to 1/0(numeric)

bank$default<-ifelse(bank$default =="yes", 1,0)
bank$housing<-ifelse(bank$housing =='yes', 1,0)
bank$loan<-ifelse(bank$loan=='yes', 1,0)
bank$y <- ifelse(bank$y == 'yes', 1, 0)
head(bank)

##   age          job marital education default balance housing loan  contact
## 1  30        other married   primary       0    1787       0    0 cellular
## 2  33  pink-collar married secondary       0    4789       1    1 cellular
## 3  35 white-collar  single  tertiary       0    1350       1    0 cellular
## 4  30 white-collar married  tertiary       0    1476       1    1  unknown
## 5  59  blue-collar married secondary       0       0       1    0  unknown
## 6  35 white-collar  single  tertiary       0     747       0    0 cellular
##   day month duration campaign pdays previous poutcome y
## 1  19   oct       79        1    -1        0  unknown 0
## 2  11   may      220        1   339        4  failure 0
## 3  16   apr      185        1   330        1  failure 0
## 4   3   jun      199        4    -1        0  unknown 0
## 5   5   may      226        1    -1        0  unknown 0
## 6  23   feb      141        2   176        3  failure 0

Drop variables

bank$day <- NULL
bank$month <- NULL
bank$contact <- NULL

Convert categorical variable into dummy

My aim is to give a matrix correlation of variables there fore I will transfer all categorical variables in to dummy ones.

# Create a copy of processed bank data
bank_origional <- bank

In the data, job, marital, education and poutcome are categorical variables. I will transfer it into dummy ones by following these code chucks:

Data Visualisation in the whole population

Scatterplot showing age and balance

Bar chart of job Vs the people they made a deposit

Heat map

bankcp<- bank # Create a copy of dummy data

corr <- cor(bank) # Create a matrix of correlation

heatmap(corr, Rowv = NA, Colv = NA)

Train and Test datasets

smp_size <- floor(0.8*nrow(bank))
set.seed(123)
train_index <- sample(seq(nrow(bank)), size = smp_size)

train <- bank[train_index, ]
test <- bank[-train_index, ]

Summary

The origional data has no missing value which is an advantage because in the reality, we have to do it ourselves.

With the variable like day, month, and contact, I deleted them as their information is not usefull for predicting models. With pdays, I add the value of “-1” as 10000 to present that it is so far in the past that it has no effect. Furthermore, a re-arragement and combination has been applied for job and poutcome variables. Also, I changed categorical variables into dummy in order to creat and visualize a heatmap matrix of correlations.

Predicting Bank Marketing Data US

MATH 2319 Machine Learning Applied Project Phase I

Mai Thi Nguyen (s3639386)

8 April 2018