The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).
Data available at UCI machine learning repository here
Citation Request: [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014
Input variables: The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (or not) subscribed.
The classification goal is to predict if the client will subscribe a term deposit (variable y).
Number of Instances: 45211 for bank-full.csv (4521 for bank.csv)
Number of Attributes: 16 + output attribute.
Attribute information:
For more information, read [Moro et al., 2011].
** Input variables:**
** bank client data:**
related with the last contact of the current campaign:
Output variable (desired target): 17. - y - has the client subscribed a term deposit? (binary: “yes”,“no”)
pacman::p_load("lubridate", "dplyr", "magrittr")
library(rio)
library(doParallel)
library(viridis)
library(RColorBrewer)
library(tidyverse)
library(ggthemes)
library(knitr)
library(tidyverse)
library(caret)
library(plotly)
library(lime)
require(xgboost)
require(Matrix)
library(DiagrammeR)
drat:::addRepo("dmlc")
# Calculate the number of cores
no_cores <- detectCores() - 1
cl<-makeCluster(no_cores)
registerDoParallel(cl)
setwd("/Users/nanaakwasiabayieboateng/Documents/memphisclassesbooks/DataMiningscience")
#load excel file with rio
Data<- rio::import("/Users/nanaakwasiabayieboateng/Documents/memphisclassesbooks/DataMiningscience/Anomalydetection/bank/bank-full.csv")
Data%>%head
#%>%kable()
summary(Data)
age job marital education default
Min. :18.00 Length:45211 Length:45211 Length:45211 Length:45211
1st Qu.:33.00 Class :character Class :character Class :character Class :character
Median :39.00 Mode :character Mode :character Mode :character Mode :character
Mean :40.94
3rd Qu.:48.00
Max. :95.00
balance housing loan contact day
Min. : -8019 Length:45211 Length:45211 Length:45211 Min. : 1.00
1st Qu.: 72 Class :character Class :character Class :character 1st Qu.: 8.00
Median : 448 Mode :character Mode :character Mode :character Median :16.00
Mean : 1362 Mean :15.81
3rd Qu.: 1428 3rd Qu.:21.00
Max. :102127 Max. :31.00
month duration campaign pdays previous
Length:45211 Min. : 0.0 Min. : 1.000 Min. : -1.0 Min. : 0.0000
Class :character 1st Qu.: 103.0 1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000
Mode :character Median : 180.0 Median : 2.000 Median : -1.0 Median : 0.0000
Mean : 258.2 Mean : 2.764 Mean : 40.2 Mean : 0.5803
3rd Qu.: 319.0 3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
Max. :4918.0 Max. :63.000 Max. :871.0 Max. :275.0000
poutcome y
Length:45211 Length:45211
Class :character Class :character
Mode :character Mode :character
There are no missing observations in the data.
#==================================================================
#check the number of missing rows
#==================================================================
colSums(is.na.data.frame(Data))
age job marital education default balance housing loan contact day
0 0 0 0 0 0 0 0 0 0
month duration campaign pdays previous poutcome y
0 0 0 0 0 0 0
Exploratory Data Analysis
#==================================================================
# descriptive/summary statistics
#==================================================================
Hmisc::describe.data.frame(Data)
Data
17 Variables 45211 Observations
--------------------------------------------------------------------------------------------------------
age
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75
45211 0 77 0.999 40.94 11.87 27 29 33 39 48
.90 .95
56 59
lowest : 18 19 20 21 22, highest: 90 92 93 94 95
--------------------------------------------------------------------------------------------------------
job
n missing distinct
45211 0 12
Value admin. blue-collar entrepreneur housemaid management retired
Frequency 5171 9732 1487 1240 9458 2264
Proportion 0.114 0.215 0.033 0.027 0.209 0.050
Value self-employed services student technician unemployed unknown
Frequency 1579 4154 938 7597 1303 288
Proportion 0.035 0.092 0.021 0.168 0.029 0.006
--------------------------------------------------------------------------------------------------------
marital
n missing distinct
45211 0 3
Value divorced married single
Frequency 5207 27214 12790
Proportion 0.115 0.602 0.283
--------------------------------------------------------------------------------------------------------
education
n missing distinct
45211 0 4
Value primary secondary tertiary unknown
Frequency 6851 23202 13301 1857
Proportion 0.152 0.513 0.294 0.041
--------------------------------------------------------------------------------------------------------
default
n missing distinct
45211 0 2
Value no yes
Frequency 44396 815
Proportion 0.982 0.018
--------------------------------------------------------------------------------------------------------
balance
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75
45211 0 7168 1 1362 2054 -172 0 72 448 1428
.90 .95
3574 5768
lowest : -8019 -6847 -4057 -3372 -3313, highest: 66721 71188 81204 98417 102127
--------------------------------------------------------------------------------------------------------
housing
n missing distinct
45211 0 2
Value no yes
Frequency 20081 25130
Proportion 0.444 0.556
--------------------------------------------------------------------------------------------------------
loan
n missing distinct
45211 0 2
Value no yes
Frequency 37967 7244
Proportion 0.84 0.16
--------------------------------------------------------------------------------------------------------
contact
n missing distinct
45211 0 3
Value cellular telephone unknown
Frequency 29285 2906 13020
Proportion 0.648 0.064 0.288
--------------------------------------------------------------------------------------------------------
day
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75
45211 0 31 0.999 15.81 9.576 3 5 8 16 21
.90 .95
28 29
lowest : 1 2 3 4 5, highest: 27 28 29 30 31
--------------------------------------------------------------------------------------------------------
month
n missing distinct
45211 0 12
Value apr aug dec feb jan jul jun mar may nov oct sep
Frequency 2932 6247 214 2649 1403 6895 5341 477 13766 3970 738 579
Proportion 0.065 0.138 0.005 0.059 0.031 0.153 0.118 0.011 0.304 0.088 0.016 0.013
--------------------------------------------------------------------------------------------------------
duration
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75
45211 0 1573 1 258.2 235.4 35 58 103 180 319
.90 .95
548 751
lowest : 0 1 2 3 4, highest: 3366 3422 3785 3881 4918
--------------------------------------------------------------------------------------------------------
campaign
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75
45211 0 48 0.918 2.764 2.383 1 1 1 2 3
.90 .95
5 8
lowest : 1 2 3 4 5, highest: 50 51 55 58 63
--------------------------------------------------------------------------------------------------------
pdays
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75
45211 0 559 0.454 40.2 71.61 -1 -1 -1 -1 -1
.90 .95
185 317
lowest : -1 1 2 3 4, highest: 838 842 850 854 871
--------------------------------------------------------------------------------------------------------
previous
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75
45211 0 41 0.454 0.5803 1.044 0 0 0 0 0
.90 .95
2 3
lowest : 0 1 2 3 4, highest: 41 51 55 58 275
--------------------------------------------------------------------------------------------------------
poutcome
n missing distinct
45211 0 4
Value failure other success unknown
Frequency 4901 1840 1511 36959
Proportion 0.108 0.041 0.033 0.817
--------------------------------------------------------------------------------------------------------
y
n missing distinct
45211 0 2
Value no yes
Frequency 39922 5289
Proportion 0.883 0.117
--------------------------------------------------------------------------------------------------------
#describe(Data)
#==================================================================
# Histograms
#==================================================================
theme_set(theme_economist_white())
#ggplot(Data) + geom_boxplot(aes(x =age,y=duration,color=y))
ggplot(Data, aes(x ="",y=age, fill=y))+ geom_boxplot()+labs(x="age",y="")
#ggplotly(p)
ggplot(Data, aes(x =duration, fill=y))+ geom_histogram(bins = 30)
#ggplotly()
ggplot(Data, aes(x =age, fill=y))+ geom_histogram(bins = 30)
# ggplotly()
ggplot(Data, aes(x =day, fill=y))+ geom_histogram(bins = 30)
#ggplotly()
ggplot(Data, aes(x =balance, fill=y))+ geom_histogram(bins = 30)
#ggplotly()
ggplot(Data, aes(x =age, fill=y))+ geom_histogram(bins = 30)
# ggplotly()
# geom_density(alpha=1/3,color="red") + scale_fill_hue()
ggplot(Data, aes(x=age, fill=y)) + geom_density(alpha=1/3) + scale_fill_hue()
ggplot(Data, aes(x=marital,previous)) +geom_violin()
ggplot(Data, aes(x=education,previous)) +geom_violin()
#ggplot(Data, aes(x="",y =loan, fill=y)) + geom_histogram()
ggplot(Data, aes(x ="",y=age, fill=y))+ geom_violin(adjust = .5,draw_quantiles = c(0.25, 0.5, 0.75))+labs(x="age",y="")
ggplotly()
NA
NA
Convert character variables to factor variables.This is neccessary for the caret package to train the models we are interested in later.
theme_set(theme_bw())
Data<-Data %>% mutate_if(is.character, as.factor)
str(Data)
'data.frame': 45211 obs. of 17 variables:
$ age : int 58 44 33 47 33 35 28 42 58 43 ...
$ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
$ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
$ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
$ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
$ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
$ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
$ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
$ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
$ day : int 5 5 5 5 5 5 5 5 5 5 ...
$ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
$ duration : int 261 151 76 92 198 139 217 380 50 55 ...
$ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
$ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
$ previous : int 0 0 0 0 0 0 0 0 0 0 ...
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
$ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
ggplot(Data, aes(x =y))+ geom_histogram(stat="count")+labs(x="Term Deposit")
#==================================================================
#Converting outcome variable to numeric
#==================================================================
Data$y<-ifelse(Data$y=='no',0,1)
str(Data)
'data.frame': 45211 obs. of 17 variables:
$ age : int 58 44 33 47 33 35 28 42 58 43 ...
$ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
$ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
$ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
$ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
$ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
$ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
$ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
$ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
$ day : int 5 5 5 5 5 5 5 5 5 5 ...
$ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
$ duration : int 261 151 76 92 198 139 217 380 50 55 ...
$ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
$ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
$ previous : int 0 0 0 0 0 0 0 0 0 0 ...
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
$ y : num 0 0 0 0 0 0 0 0 0 0 ...
glimpse(Data)
Observations: 45,211
Variables: 17
$ age <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, 51, 45, 57, 60, 33, 2...
$ job <fctr> management, technician, entrepreneur, blue-collar, unknown, management, manageme...
$ marital <fctr> married, single, married, married, single, married, single, divorced, married, s...
$ education <fctr> tertiary, secondary, secondary, unknown, unknown, tertiary, tertiary, tertiary, ...
$ default <fctr> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no, no, no, no, no, no,...
$ balance <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71, 162, 229, 13, 52, 6...
$ housing <fctr> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y...
$ loan <fctr> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, no, no, no, no, no, no...
$ contact <fctr> unknown, unknown, unknown, unknown, unknown, unknown, unknown, unknown, unknown,...
$ day <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ...
$ month <fctr> may, may, may, may, may, may, may, may, may, may, may, may, may, may, may, may, ...
$ duration <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517, 71, 174, 353, 98, 38...
$ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -...
$ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ poutcome <fctr> unknown, unknown, unknown, unknown, unknown, unknown, unknown, unknown, unknown,...
$ y <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Convert Categorical varialbes to dummy variables using either Model.matrix or sparse.model,matrix. The \(-1\) in the formula removes the first column which is all ones.
predictors<-setdiff(names(Data),Data$y)
#names(Data)
predictors<-names(Data[,-17])
#paste0(predictors,sep="",collapse = "+ ")
#paste("~",paste0(predictors,sep="",collapse = "+ "))
#as.formula(paste("y~",paste0(predictors,sep="",collapse = "+ ")))
d1<-model.matrix(y~.-1, Data)
d1b<-Matrix::sparse.model.matrix(y~.-1, Data)
head(d1)
age jobadmin. jobblue-collar jobentrepreneur jobhousemaid jobmanagement jobretired jobself-employed
1 58 0 0 0 0 1 0 0
2 44 0 0 0 0 0 0 0
3 33 0 0 1 0 0 0 0
4 47 0 1 0 0 0 0 0
5 33 0 0 0 0 0 0 0
6 35 0 0 0 0 1 0 0
jobservices jobstudent jobtechnician jobunemployed jobunknown maritalmarried maritalsingle
1 0 0 0 0 0 1 0
2 0 0 1 0 0 0 1
3 0 0 0 0 0 1 0
4 0 0 0 0 0 1 0
5 0 0 0 0 1 0 1
6 0 0 0 0 0 1 0
educationsecondary educationtertiary educationunknown defaultyes balance housingyes loanyes
1 0 1 0 0 2143 1 0
2 1 0 0 0 29 1 0
3 1 0 0 0 2 1 1
4 0 0 1 0 1506 1 0
5 0 0 1 0 1 0 0
6 0 1 0 0 231 1 0
contacttelephone contactunknown day monthaug monthdec monthfeb monthjan monthjul monthjun monthmar
1 0 1 5 0 0 0 0 0 0 0
2 0 1 5 0 0 0 0 0 0 0
3 0 1 5 0 0 0 0 0 0 0
4 0 1 5 0 0 0 0 0 0 0
5 0 1 5 0 0 0 0 0 0 0
6 0 1 5 0 0 0 0 0 0 0
monthmay monthnov monthoct monthsep duration campaign pdays previous poutcomeother poutcomesuccess
1 1 0 0 0 261 1 -1 0 0 0
2 1 0 0 0 151 1 -1 0 0 0
3 1 0 0 0 76 1 -1 0 0 0
4 1 0 0 0 92 1 -1 0 0 0
5 1 0 0 0 198 1 -1 0 0 0
6 1 0 0 0 139 1 -1 0 0 0
poutcomeunknown
1 1
2 1
3 1
4 1
5 1
6 1
head(d1b)
6 x 43 sparse Matrix of class "dgCMatrix"
[[ suppressing 43 column names ‘age’, ‘jobadmin.’, ‘jobblue-collar’ ... ]]
1 58 . . . . 1 . . . . . . . 1 . . 1 . . 2143 1 . . 1 5 . . . . . . . 1 . . . 261 1 -1 . . . 1
2 44 . . . . . . . . . 1 . . . 1 1 . . . 29 1 . . 1 5 . . . . . . . 1 . . . 151 1 -1 . . . 1
3 33 . . 1 . . . . . . . . . 1 . 1 . . . 2 1 1 . 1 5 . . . . . . . 1 . . . 76 1 -1 . . . 1
4 47 . 1 . . . . . . . . . . 1 . . . 1 . 1506 1 . . 1 5 . . . . . . . 1 . . . 92 1 -1 . . . 1
5 33 . . . . . . . . . . . 1 . 1 . . 1 . 1 . . . 1 5 . . . . . . . 1 . . . 198 1 -1 . . . 1
6 35 . . . . 1 . . . . . . . 1 . . 1 . . 231 1 . . 1 5 . . . . . . . 1 . . . 139 1 -1 . . . 1
names(d1b)
NULL
#index<-createDataPartition(transformed$y,p=0.70, list=FALSE)
#trainSet<-transformed[index,]
#testSet<-transformed[-index,]
Next step, we will transform the categorical data to dummy variables. This is the one-hot encoding step.
The purpose is to transform each value of each categorical feature in a binary feature {0, 1}. The dummy conversion results in 42 variables.
#==================================================================
#convert categorical variables to numeric variables
#==================================================================
dmy <- dummyVars(" ~ .", data = Data,fullRank = T)
transformed <- data.frame(predict(dmy, newdata =Data))
closing unused connection 23 (<-localhost:11246)closing unused connection 22 (<-localhost:11246)closing unused connection 21 (<-localhost:11246)closing unused connection 20 (<-localhost:11246)closing unused connection 19 (<-localhost:11246)closing unused connection 18 (<-localhost:11246)closing unused connection 17 (<-localhost:11246)
#save(transformed,file="transformed.RData")
load("transformed.RData")
#Checking the structure of transformed train file
str(transformed)
'data.frame': 45211 obs. of 43 variables:
$ age : num 58 44 33 47 33 35 28 42 58 43 ...
$ jobblue.collar : num 0 0 0 1 0 0 0 0 0 0 ...
$ jobentrepreneur : num 0 0 1 0 0 0 0 1 0 0 ...
$ jobhousemaid : num 0 0 0 0 0 0 0 0 0 0 ...
$ jobmanagement : num 1 0 0 0 0 1 1 0 0 0 ...
$ jobretired : num 0 0 0 0 0 0 0 0 1 0 ...
$ jobself.employed : num 0 0 0 0 0 0 0 0 0 0 ...
$ jobservices : num 0 0 0 0 0 0 0 0 0 0 ...
$ jobstudent : num 0 0 0 0 0 0 0 0 0 0 ...
$ jobtechnician : num 0 1 0 0 0 0 0 0 0 1 ...
$ jobunemployed : num 0 0 0 0 0 0 0 0 0 0 ...
$ jobunknown : num 0 0 0 0 1 0 0 0 0 0 ...
$ maritalmarried : num 1 0 1 1 0 1 0 0 1 0 ...
$ maritalsingle : num 0 1 0 0 1 0 1 0 0 1 ...
$ educationsecondary: num 0 1 1 0 0 0 0 0 0 1 ...
$ educationtertiary : num 1 0 0 0 0 1 1 1 0 0 ...
$ educationunknown : num 0 0 0 1 1 0 0 0 0 0 ...
$ defaultyes : num 0 0 0 0 0 0 0 1 0 0 ...
$ balance : num 2143 29 2 1506 1 ...
$ housingyes : num 1 1 1 1 0 1 1 1 1 1 ...
$ loanyes : num 0 0 1 0 0 0 1 0 0 0 ...
$ contacttelephone : num 0 0 0 0 0 0 0 0 0 0 ...
$ contactunknown : num 1 1 1 1 1 1 1 1 1 1 ...
$ day : num 5 5 5 5 5 5 5 5 5 5 ...
$ monthaug : num 0 0 0 0 0 0 0 0 0 0 ...
$ monthdec : num 0 0 0 0 0 0 0 0 0 0 ...
$ monthfeb : num 0 0 0 0 0 0 0 0 0 0 ...
$ monthjan : num 0 0 0 0 0 0 0 0 0 0 ...
$ monthjul : num 0 0 0 0 0 0 0 0 0 0 ...
$ monthjun : num 0 0 0 0 0 0 0 0 0 0 ...
$ monthmar : num 0 0 0 0 0 0 0 0 0 0 ...
$ monthmay : num 1 1 1 1 1 1 1 1 1 1 ...
$ monthnov : num 0 0 0 0 0 0 0 0 0 0 ...
$ monthoct : num 0 0 0 0 0 0 0 0 0 0 ...
$ monthsep : num 0 0 0 0 0 0 0 0 0 0 ...
$ duration : num 261 151 76 92 198 139 217 380 50 55 ...
$ campaign : num 1 1 1 1 1 1 1 1 1 1 ...
$ pdays : num -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
$ previous : num 0 0 0 0 0 0 0 0 0 0 ...
$ poutcomeother : num 0 0 0 0 0 0 0 0 0 0 ...
$ poutcomesuccess : num 0 0 0 0 0 0 0 0 0 0 ...
$ poutcomeunknown : num 1 1 1 1 1 1 1 1 1 1 ...
$ y : num 0 0 0 0 0 0 0 0 0 0 ...
#==================================================================
#Converting the dependent variable back to categorical
#==================================================================
transformed$y<-as.factor(transformed$y)
#==================================================================
#Spliting training set into two parts based on outcome: 70% and 30%
#==================================================================
index<-createDataPartition(transformed$y,p=0.70, list=FALSE)
trainSet<-transformed[index,]
testSet<-transformed[-index,]
outcomeName<-'y'
predictors<-names(trainSet)[!names(trainSet) %in% outcomeName]
predictors
[1] "age" "jobblue.collar" "jobentrepreneur" "jobhousemaid"
[5] "jobmanagement" "jobretired" "jobself.employed" "jobservices"
[9] "jobstudent" "jobtechnician" "jobunemployed" "jobunknown"
[13] "maritalmarried" "maritalsingle" "educationsecondary" "educationtertiary"
[17] "educationunknown" "defaultyes" "balance" "housingyes"
[21] "loanyes" "contacttelephone" "contactunknown" "day"
[25] "monthaug" "monthdec" "monthfeb" "monthjan"
[29] "monthjul" "monthjun" "monthmar" "monthmay"
[33] "monthnov" "monthoct" "monthsep" "duration"
[37] "campaign" "pdays" "previous" "poutcomeother"
[41] "poutcomesuccess" "poutcomeunknown"
dim(trainSet)
[1] 31649 43
dim(testSet)
[1] 13562 43
The training data has 31649 rows and 43 columns whereas the test data has 13562 rows and 43 columns. The number of features 43, is soo high that feature engineering is neccessary to reduce overfitting and complexity of the models we are going to build.
Recursive Feature Selection
#==================================================================
#Feature selection using rfe in caret(recursive feature extraction)
#predictors<-names(trainSet)[!names(trainSet) %in% outcomeName]
#Alternatively
#predictors<-setdiff(names(trainSet),outcomeName)
#==================================================================
library(randomForest)
#
# control <- rfeControl(functions = rfFuncs,
# method = "repeatedcv",
# repeats = 3,
# verbose = FALSE,
# allowParallel = TRUE)
# outcomeName<-'y'
#
# predictors<-names(trainSet)[!names(trainSet) %in% outcomeName]
#
# feature_select <- rfe(trainSet[,predictors], trainSet[,outcomeName],
# rfeControl = control)
#
#
#
#
#
# table(trainSet$y)
#
#
# feature_select
#
#
# names(feature_select)
#
#
#
# predictors(feature_select)
#
# summary(feature_select)
#
# feature_select[ "bestSubset"]
#save(feature_select,file="feature_select.RData")
load("feature_select.RData")
#The top 5 variables (out of 42):
#print("The top 5 variables (out of 42)\n")
cat("The top 5 variables (out of 42)\n")
The top 5 variables (out of 42)
cat("duration, poutcomesuccess, monthmar, contactunknown, housingyes\n")
duration, poutcomesuccess, monthmar, contactunknown, housingyes
predictors(feature_select)
[1] "duration" "poutcomesuccess" "monthmar" "contactunknown" "housingyes"
[6] "age" "monthoct" "day" "monthsep" "pdays"
[11] "monthjul" "monthaug" "monthmay" "monthjun" "monthdec"
[16] "campaign"
#===================================================================================
# plot variable selection
#===================================================================================
trellis.par.set(caretTheme())
plot(feature_select, type = c( "o","g"))
About 8(14 variables ) features provides the optimal accuracy for training.
The top 5 variables provides an accuracy of about 90% for the data. The remaining 36 variables add less than 0.1 . This is the advantage of feature engineering. It helps to reduce complexity in the model, reduce overfitting and also computationaly time.
#duration, poutcomesuccess, monthmar, contactunknown, housingyes
Data[,c("duration","poutcome","month","contact","housing" )]%>%head()
names(Data)
[1] "age" "job" "marital" "education" "default" "balance" "housing" "loan"
[9] "contact" "day" "month" "duration" "campaign" "pdays" "previous" "poutcome"
[17] "y"
#Data$y<-if_else(Data$y=="y",1,0)
Data$y<-ifelse(Data$y=='no',0,1)
index=sample(x=1:2, size=dim(Data)[1], replace = TRUE, prob = c(0.3,0.7))
#datanew<-Matrix::sparse.model.matrix(~.-1,Data[,c("duration","poutcome","month","contact","housing" )])
train=Data[index==2,]
test=Data[index==1,]
train_features<-Matrix::sparse.model.matrix(~.-1,train[,c("duration","poutcome","month","contact","housing" )])
#train_label<-Matrix::sparse.model.matrix(~-1+train[,c("y" )])
train_label<-train[,"y"]
test_features<-Matrix::sparse.model.matrix(~.-1,test[,c("duration","poutcome","month","contact","housing" )])
#train_label<-if_else(train_label=="y",1,0)
test_label<-test[,"y"]
#dim(train)
#dim(test)
#train_features<-model.matrix(~.-1,train[,c("duration","poutcome","month","contact","housing" )])
#train_label<-data.frame(if_else(train_label=="y",1,0))
dim(train_features)
[1] 31690 19
dim(train_label)
NULL
train_features%>%dim()
[1] 31690 19
train_label%>%length()
[1] 31690
head(Data)
#class(train)
The test data has 13542 rows whereas the training data has 31669 rows.
#train_features<-matrix(data.frame(train[,c("duration","poutcome","month","contact","housing" )]))
#train_label<-matrix(train_label)
class(train_label)
[1] "numeric"
class(train_label)
[1] "numeric"
#train_features<-lapply(train_features, as.numeric)
#train_label<-lapply(train_label, as.numeric)
#train_features<-as.numeric(train_features)
#train_label<-lapply(train_label,as.numeric)
dim(train_features)
[1] 31690 19
length(train_label)
[1] 31690
# Train the xgboost model using the "xgboost" function
dtrain = xgb.DMatrix(data =train_features,label=train_label)
xgModel = xgboost(data = dtrain, nround = 5, objective = "binary:logistic")
[1] train-error:0.000000
[2] train-error:0.000000
[3] train-error:0.000000
[4] train-error:0.000000
[5] train-error:0.000000
cv = xgb.cv(data = dtrain, nround = 10, nfold = 5, objective = "binary:logistic")
[1] train-error:0.000000+0.000000 test-error:0.000000+0.000000
[2] train-error:0.000000+0.000000 test-error:0.000000+0.000000
[3] train-error:0.000000+0.000000 test-error:0.000000+0.000000
[4] train-error:0.000000+0.000000 test-error:0.000000+0.000000
[5] train-error:0.000000+0.000000 test-error:0.000000+0.000000
[6] train-error:0.000000+0.000000 test-error:0.000000+0.000000
[7] train-error:0.000000+0.000000 test-error:0.000000+0.000000
[8] train-error:0.000000+0.000000 test-error:0.000000+0.000000
[9] train-error:0.000000+0.000000 test-error:0.000000+0.000000
[10] train-error:0.000000+0.000000 test-error:0.000000+0.000000
# save and load model
xgb.save(xgModel, 'xgModel')
[1] TRUE
bst <- xgb.load('xgModel')
# Make the predictions on the test data
preds = predict(xgModel, test_features)
stopImplicitCluster()