The objective of this project was to build classifiers to predict whether a person from variety walks of life would subscribe yes or no to a term deposit. The data I was using is from UCI website( Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/Bank+Marketing). This project has two phases. Phase I focuses on data preprocessing and exploration, as covered in this report. We shall present model building in Phase II. The rest of this report is organised as follow. Section 2 describes the data sets and their attributes. Section 3 covers data pre-processing. In Section 4, we explore each attribute and their inter-relationships. The last section ends with a summary.
The UCI Machine Learning Repository provides two zip files, but only bank.zip, was useful in this project. bank.csv and bank.names are the full and description data sets respectively. bank-names contains the details of attributes or variables. The full data set has 45211 observations. I will take 80% of the sample with all attributes as trainning data and the rest is to test model. In Phase II, we would build the classifiers from the full the data set and evaluate their performance using cross-validation.
The response feature is income which is given
The target feature has two classes and hence it is a binary classification problem. To reiterate, The goal is to predict whether a person subcribe a deposit.
The variable description is produced here from bank.name file:
In this project, we used the following R packages.
bank <- read.csv("~/Downloads/bank.csv", stringsAsFactors = FALSE,
header = TRUE, sep = ';')
str(bank)
## 'data.frame': 4521 obs. of 17 variables:
## $ age : int 30 33 35 30 59 35 36 39 41 43 ...
## $ job : chr "unemployed" "services" "management" "management" ...
## $ marital : chr "married" "married" "single" "married" ...
## $ education: chr "primary" "secondary" "tertiary" "tertiary" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
## $ housing : chr "no" "yes" "yes" "yes" ...
## $ loan : chr "no" "yes" "no" "yes" ...
## $ contact : chr "cellular" "cellular" "cellular" "unknown" ...
## $ day : int 19 11 16 3 5 23 14 6 14 17 ...
## $ month : chr "oct" "may" "apr" "jun" ...
## $ duration : int 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
## $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
## $ previous : int 0 4 1 0 0 3 2 0 0 2 ...
## $ poutcome : chr "unknown" "failure" "failure" "unknown" ...
## $ y : chr "no" "no" "no" "no" ...
summarizeColumns(bank) %>% knitr::kable( caption = 'Feature Summary before Data Preprocessing')
| name | type | na | mean | disp | median | mad | min | max | nlevs |
|---|---|---|---|---|---|---|---|---|---|
| age | integer | 0 | 41.1700951 | 10.5762110 | 39 | 10.3782 | 19 | 87 | 0 |
| job | character | 0 | NA | 0.7856669 | NA | NA | 38 | 969 | 12 |
| marital | character | 0 | NA | 0.3813316 | NA | NA | 528 | 2797 | 3 |
| education | character | 0 | NA | 0.4899359 | NA | NA | 187 | 2306 | 4 |
| default | character | 0 | NA | 0.0168104 | NA | NA | 76 | 4445 | 2 |
| balance | integer | 0 | 1422.6578191 | 3009.6381425 | 444 | 658.2744 | -3313 | 71188 | 0 |
| housing | character | 0 | NA | 0.4339748 | NA | NA | 1962 | 2559 | 2 |
| loan | character | 0 | NA | 0.1528423 | NA | NA | 691 | 3830 | 2 |
| contact | character | 0 | NA | 0.3594338 | NA | NA | 301 | 2896 | 3 |
| day | integer | 0 | 15.9152842 | 8.2476673 | 16 | 10.3782 | 1 | 31 | 0 |
| month | character | 0 | NA | 0.6907764 | NA | NA | 20 | 1398 | 12 |
| duration | integer | 0 | 263.9612917 | 259.8566326 | 185 | 143.8122 | 4 | 3025 | 0 |
| campaign | integer | 0 | 2.7936297 | 3.1098067 | 2 | 1.4826 | 1 | 50 | 0 |
| pdays | integer | 0 | 39.7666445 | 100.1211244 | -1 | 0.0000 | -1 | 871 | 0 |
| previous | integer | 0 | 0.5425791 | 1.6935624 | 0 | 0.0000 | 0 | 25 | 0 |
| poutcome | character | 0 | NA | 0.1804910 | NA | NA | 129 | 3705 | 4 |
| y | character | 0 | NA | 0.1152400 | NA | NA | 521 | 4000 | 2 |
We removed the excessive white spaces for all character features.
bank[, sapply( bank, is.character )] <- sapply( bank[,sapply(bank, is.character )], trimws)
## remove exessive white spaces
table(bank$y)
##
## no yes
## 4000 521
It seems to me that the origional data is pretty clean and has no missing variable. Now we are going to an interesting part: Visualisation.
boxplot( data= bank, bank$age, col = "darkblue")
ggplot(bank, aes(x=age)) +
geom_histogram(aes(y=..density..), colour="grey", fill="lightblue", linetype="dashed")+
geom_density(alpha=.2, fill="#FF6666")+
geom_vline(aes(xintercept= mean(age)), color="blue", linetype="dashed", size=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
boxplot( data= bank, bank$duration, col = "darkblue",boxwex = 0.75)
ggplot(bank, aes(x=duration)) +
geom_histogram(aes(y=..density..), colour="grey", fill="lightblue", linetype="dashed")+
geom_density(alpha=.2)+
geom_vline(aes(xintercept= mean(duration)), color="blue",
linetype="dashed", size=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(bank, aes(x=duration, fill=y)) + geom_density(alpha=.3)
NROW(na.omit(bank))
## [1] 4521
bank %>%
group_by(job) %>%
summarise(
count = sum(y == 'yes'))
## # A tibble: 12 x 2
## job count
## <chr> <int>
## 1 admin. 58
## 2 blue-collar 69
## 3 entrepreneur 15
## 4 housemaid 14
## 5 management 131
## 6 retired 54
## 7 self-employed 20
## 8 services 38
## 9 student 19
## 10 technician 83
## 11 unemployed 13
## 12 unknown 7
From here, I would like to combine similar jobs into categories. In which: - admin. and management will be white-collar; - services and housemaid will be pink-collar; - retired and student, uneployed, unknown will be other;
bank$job <- as.character(bank$job)
bank$job[bank$job %in% c("admin.","management")] <- "white-collar"
bank$job[bank$job %in% c('services','housemaid')] <- "pink-collar"
bank$job[bank$job %in% c('retired', 'student', 'unemployed', 'unknown')] <- "other"
head(bank)
## age job marital education default balance housing loan contact
## 1 30 other married primary no 1787 no no cellular
## 2 33 pink-collar married secondary no 4789 yes yes cellular
## 3 35 white-collar single tertiary no 1350 yes no cellular
## 4 30 white-collar married tertiary no 1476 yes yes unknown
## 5 59 blue-collar married secondary no 0 yes no unknown
## 6 35 white-collar single tertiary no 747 no no cellular
## day month duration campaign pdays previous poutcome y
## 1 19 oct 79 1 -1 0 unknown no
## 2 11 may 220 1 339 4 failure no
## 3 16 apr 185 1 330 1 failure no
## 4 3 jun 199 4 -1 0 unknown no
## 5 5 may 226 1 -1 0 unknown no
## 6 23 feb 141 2 176 3 failure no
bank$job <- factor(bank$job)
bank %>%
group_by(job) %>%
summarise(
count = sum(y == 'yes'))
## # A tibble: 7 x 2
## job count
## <fct> <int>
## 1 blue-collar 69
## 2 entrepreneur 15
## 3 other 93
## 4 pink-collar 52
## 5 self-employed 20
## 6 technician 83
## 7 white-collar 189
We explored the data for each feature individually and split them by the classes of target features. Then we proceeded to multivariate visualisation.
bank$poutcome <- as.character(bank$poutcome)
bank$poutcome[bank$poutcome %in% c("other")] <- "unknown"
head(bank$poutcome)
## [1] "unknown" "failure" "failure" "unknown" "unknown" "failure"
Value for these variable from yes/no(character) to 1/0(numeric)
bank$default<-ifelse(bank$default =="yes", 1,0)
bank$housing<-ifelse(bank$housing =='yes', 1,0)
bank$loan<-ifelse(bank$loan=='yes', 1,0)
bank$y <- ifelse(bank$y == 'yes', 1, 0)
head(bank)
## age job marital education default balance housing loan contact
## 1 30 other married primary 0 1787 0 0 cellular
## 2 33 pink-collar married secondary 0 4789 1 1 cellular
## 3 35 white-collar single tertiary 0 1350 1 0 cellular
## 4 30 white-collar married tertiary 0 1476 1 1 unknown
## 5 59 blue-collar married secondary 0 0 1 0 unknown
## 6 35 white-collar single tertiary 0 747 0 0 cellular
## day month duration campaign pdays previous poutcome y
## 1 19 oct 79 1 -1 0 unknown 0
## 2 11 may 220 1 339 4 failure 0
## 3 16 apr 185 1 330 1 failure 0
## 4 3 jun 199 4 -1 0 unknown 0
## 5 5 may 226 1 -1 0 unknown 0
## 6 23 feb 141 2 176 3 failure 0
bank$day <- NULL
bank$month <- NULL
bank$contact <- NULL
My aim is to give a matrix correlation of variables there fore I will transfer all categorical variables in to dummy ones.
# Create a copy of processed bank data
bank_origional <- bank
In the data, job, marital, education and poutcome are categorical variables. I will transfer it into dummy ones by following these code chucks:
bankcp<- bank # Create a copy of dummy data
corr <- cor(bank) # Create a matrix of correlation
heatmap(corr, Rowv = NA, Colv = NA)
smp_size <- floor(0.8*nrow(bank))
set.seed(123)
train_index <- sample(seq(nrow(bank)), size = smp_size)
train <- bank[train_index, ]
test <- bank[-train_index, ]
The origional data has no missing value which is an advantage because in the reality, we have to do it ourselves.
With the variable like day, month, and contact, I deleted them as their information is not usefull for predicting models. With pdays, I add the value of “-1” as 10000 to present that it is so far in the past that it has no effect. Furthermore, a re-arragement and combination has been applied for job and poutcome variables. Also, I changed categorical variables into dummy in order to creat and visualize a heatmap matrix of correlations.