The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

Data available at UCI machine learning repository available here

Citation Request: [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

Input variables: The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (or not) subscribed.

The classification goal is to predict if the client will subscribe a term deposit (variable y).

  1. Number of Instances: 45211 for bank-full.csv (4521 for bank.csv)

  2. Number of Attributes: 16 + output attribute.

  3. Attribute information:

** Input variables:**

** bank client data description:**

** Summary ** The most important predictors of the Bank loan Term Deposit Subscription are, successful outcome of the previous marketing campaign, last contact duration, a person with a housing loan, if the last contact month of year is March. In general The higher last contact duration, successful outcome of the previous marketing campaign and contacting people in march, the higher the odds of success that a person wil subscribe to a loan term deposit. On the other people with contact unknown and a housing loan are less likely to subscribe to a banking deposit loan.

    • age (numeric)
    • job : type of job (categorical:“admin.”,“unknown”,“unemployed”,“management”,“housemaid”, “entrepreneur”,“student”,“blue-collar”,“self-employed”,“retired”,“technician”,“services”)
    • marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed)
    • education (categorical: “unknown”,“secondary”,“primary”,“tertiary”)
    • default: has credit in default? (binary: “yes”,“no”)
    • balance: average yearly balance, in euros (numeric)
    • housing: has housing loan? (binary: “yes”,“no”)
    • loan: has personal loan? (binary: “yes”,“no”)

related with the last contact of the current campaign:

    • contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”)
    • day: last contact day of the month (numeric)
    • month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
    • duration: last contact duration, in seconds (numeric) # other attributes:
    • campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
    • pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
    • previous: number of contacts performed before this campaign and for this client (numeric)
    • poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)

Output variable (desired target): 17. - y - has the client subscribed a term deposit? (binary: “yes”,“no”)

  1. Missing Attribute Values: None
pacman::p_load("lubridate", "dplyr", "magrittr")

library(rio)
library(doParallel)
library(viridis)
library(RColorBrewer)
library(tidyverse)
library(ggthemes)
library(knitr)
library(tidyverse)
library(caret)
library(caretEnsemble)
library(plotly)
library(lime)
library(plotROC)
library(pROC)
require(xgboost)
require(Matrix)
# Calculate the number of cores
no_cores <- detectCores() - 1

cl<-makeCluster(no_cores)
registerDoParallel(cl)


setwd("/Users/nanaakwasiabayieboateng/Documents/memphisclassesbooks/DataMiningscience")
#load excel file with rio
Data<- rio::import("/Users/nanaakwasiabayieboateng/Documents/memphisclassesbooks/DataMiningscience/Anomalydetection/bank/bank-full.csv")

Data%>%head
#%>%kable()
summary(Data)
##       age            job              marital           education        
##  Min.   :18.00   Length:45211       Length:45211       Length:45211      
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.94                                                           
##  3rd Qu.:48.00                                                           
##  Max.   :95.00                                                           
##    default             balance         housing              loan          
##  Length:45211       Min.   : -8019   Length:45211       Length:45211      
##  Class :character   1st Qu.:    72   Class :character   Class :character  
##  Mode  :character   Median :   448   Mode  :character   Mode  :character  
##                     Mean   :  1362                                        
##                     3rd Qu.:  1428                                        
##                     Max.   :102127                                        
##    contact               day           month              duration     
##  Length:45211       Min.   : 1.00   Length:45211       Min.   :   0.0  
##  Class :character   1st Qu.: 8.00   Class :character   1st Qu.: 103.0  
##  Mode  :character   Median :16.00   Mode  :character   Median : 180.0  
##                     Mean   :15.81                      Mean   : 258.2  
##                     3rd Qu.:21.00                      3rd Qu.: 319.0  
##                     Max.   :31.00                      Max.   :4918.0  
##     campaign          pdays          previous          poutcome        
##  Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000   Length:45211      
##  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000   Class :character  
##  Median : 2.000   Median : -1.0   Median :  0.0000   Mode  :character  
##  Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803                     
##  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000                     
##  Max.   :63.000   Max.   :871.0   Max.   :275.0000                     
##       y            
##  Length:45211      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

There are no missing observations in the data.

#==================================================================
#check the number of missing rows
#==================================================================

colSums(is.na.data.frame(Data))
##       age       job   marital education   default   balance   housing 
##         0         0         0         0         0         0         0 
##      loan   contact       day     month  duration  campaign     pdays 
##         0         0         0         0         0         0         0 
##  previous  poutcome         y 
##         0         0         0
#==================================================================
# descriptive/summary statistics
#==================================================================



Hmisc::describe.data.frame(Data)   
## Data 
## 
##  17  Variables      45211  Observations
## ---------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    45211        0       77    0.999    40.94    11.87       27       29 
##      .25      .50      .75      .90      .95 
##       33       39       48       56       59 
## 
## lowest : 18 19 20 21 22, highest: 90 92 93 94 95
## ---------------------------------------------------------------------------
## job 
##        n  missing distinct 
##    45211        0       12 
## 
## admin. (5171, 0.114), blue-collar (9732, 0.215), entrepreneur (1487,
## 0.033), housemaid (1240, 0.027), management (9458, 0.209), retired (2264,
## 0.050), self-employed (1579, 0.035), services (4154, 0.092), student (938,
## 0.021), technician (7597, 0.168), unemployed (1303, 0.029), unknown (288,
## 0.006)
## ---------------------------------------------------------------------------
## marital 
##        n  missing distinct 
##    45211        0        3 
##                                      
## Value      divorced  married   single
## Frequency      5207    27214    12790
## Proportion    0.115    0.602    0.283
## ---------------------------------------------------------------------------
## education 
##        n  missing distinct 
##    45211        0        4 
##                                                   
## Value        primary secondary  tertiary   unknown
## Frequency       6851     23202     13301      1857
## Proportion     0.152     0.513     0.294     0.041
## ---------------------------------------------------------------------------
## default 
##        n  missing distinct 
##    45211        0        2 
##                       
## Value         no   yes
## Frequency  44396   815
## Proportion 0.982 0.018
## ---------------------------------------------------------------------------
## balance 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    45211        0     7168        1     1362     2054     -172        0 
##      .25      .50      .75      .90      .95 
##       72      448     1428     3574     5768 
## 
## lowest :  -8019  -6847  -4057  -3372  -3313, highest:  66721  71188  81204  98417 102127
## ---------------------------------------------------------------------------
## housing 
##        n  missing distinct 
##    45211        0        2 
##                       
## Value         no   yes
## Frequency  20081 25130
## Proportion 0.444 0.556
## ---------------------------------------------------------------------------
## loan 
##        n  missing distinct 
##    45211        0        2 
##                       
## Value         no   yes
## Frequency  37967  7244
## Proportion  0.84  0.16
## ---------------------------------------------------------------------------
## contact 
##        n  missing distinct 
##    45211        0        3 
##                                         
## Value       cellular telephone   unknown
## Frequency      29285      2906     13020
## Proportion     0.648     0.064     0.288
## ---------------------------------------------------------------------------
## day 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    45211        0       31    0.999    15.81    9.576        3        5 
##      .25      .50      .75      .90      .95 
##        8       16       21       28       29 
## 
## lowest :  1  2  3  4  5, highest: 27 28 29 30 31
## ---------------------------------------------------------------------------
## month 
##        n  missing distinct 
##    45211        0       12 
##                                                                       
## Value        apr   aug   dec   feb   jan   jul   jun   mar   may   nov
## Frequency   2932  6247   214  2649  1403  6895  5341   477 13766  3970
## Proportion 0.065 0.138 0.005 0.059 0.031 0.153 0.118 0.011 0.304 0.088
##                       
## Value        oct   sep
## Frequency    738   579
## Proportion 0.016 0.013
## ---------------------------------------------------------------------------
## duration 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    45211        0     1573        1    258.2    235.4       35       58 
##      .25      .50      .75      .90      .95 
##      103      180      319      548      751 
## 
## lowest :    0    1    2    3    4, highest: 3366 3422 3785 3881 4918
## ---------------------------------------------------------------------------
## campaign 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    45211        0       48    0.918    2.764    2.383        1        1 
##      .25      .50      .75      .90      .95 
##        1        2        3        5        8 
## 
## lowest :  1  2  3  4  5, highest: 50 51 55 58 63
## ---------------------------------------------------------------------------
## pdays 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    45211        0      559    0.454     40.2    71.61       -1       -1 
##      .25      .50      .75      .90      .95 
##       -1       -1       -1      185      317 
## 
## lowest :  -1   1   2   3   4, highest: 838 842 850 854 871
## ---------------------------------------------------------------------------
## previous 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    45211        0       41    0.454   0.5803    1.044        0        0 
##      .25      .50      .75      .90      .95 
##        0        0        0        2        3 
## 
## lowest :   0   1   2   3   4, highest:  41  51  55  58 275
## ---------------------------------------------------------------------------
## poutcome 
##        n  missing distinct 
##    45211        0        4 
##                                           
## Value      failure   other success unknown
## Frequency     4901    1840    1511   36959
## Proportion   0.108   0.041   0.033   0.817
## ---------------------------------------------------------------------------
## y 
##        n  missing distinct 
##    45211        0        2 
##                       
## Value         no   yes
## Frequency  39922  5289
## Proportion 0.883 0.117
## ---------------------------------------------------------------------------
#describe(Data)
#==================================================================
# Histograms
#==================================================================
theme_set(theme_economist_white())

#ggplot(Data) + geom_boxplot(aes(x =age,y=duration,color=y))

ggplot(Data, aes(x ="",y=age, fill=y))+ geom_boxplot()+labs(x="age",y="")

#ggplotly(p)

ggplot(Data, aes(x =duration, fill=y))+ geom_histogram(bins = 30)

#ggplotly()

ggplot(Data, aes(x =age, fill=y))+ geom_histogram(bins = 30)

# ggplotly()
 
ggplot(Data, aes(x =day, fill=y))+ geom_histogram(bins = 30)

#ggplotly()

ggplot(Data, aes(x =balance, fill=y))+ geom_histogram(bins = 30)

#ggplotly()

ggplot(Data, aes(x =age, fill=y))+ geom_histogram(bins = 30)

# ggplotly() 
 # geom_density(alpha=1/3,color="red") + scale_fill_hue()

ggplot(Data, aes(x=age, fill=y)) + geom_density(alpha=1/3) + scale_fill_hue()

ggplot(Data, aes(x=marital,previous)) +geom_violin()

ggplot(Data, aes(x=education,previous)) +geom_violin()

#ggplot(Data, aes(x="",y =loan, fill=y)) + geom_histogram()

ggplot(Data, aes(x ="",y=age, fill=y))+ geom_violin(adjust = .5,draw_quantiles = c(0.25, 0.5, 0.75))+labs(x="age",y="")

ggplotly()

Convert character variables to factor variables.This is neccessary for the caret package to train the models we are interested in later.

theme_set(theme_bw())

Data<-Data %>% mutate_if(is.character, as.factor) 

str(Data)
## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
ggplot(Data, aes(x =y))+ geom_histogram(stat="count")+labs(x="Term Deposit")

#==================================================================

#Converting outcome variable to numeric
#==================================================================

Data$y<-ifelse(Data$y=='no',0,1)



str(Data)
## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ y        : num  0 0 0 0 0 0 0 0 0 0 ...
glimpse(Data)
## Observations: 45,211
## Variables: 17
## $ age       <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, ...
## $ job       <fctr> management, technician, entrepreneur, blue-collar, ...
## $ marital   <fctr> married, single, married, married, single, married,...
## $ education <fctr> tertiary, secondary, secondary, unknown, unknown, t...
## $ default   <fctr> no, no, no, no, no, no, no, yes, no, no, no, no, no...
## $ balance   <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 39...
## $ housing   <fctr> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, ye...
## $ loan      <fctr> no, no, yes, no, no, no, yes, no, no, no, no, no, n...
## $ contact   <fctr> unknown, unknown, unknown, unknown, unknown, unknow...
## $ day       <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ month     <fctr> may, may, may, may, may, may, may, may, may, may, m...
## $ duration  <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 1...
## $ campaign  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, ...
## $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ poutcome  <fctr> unknown, unknown, unknown, unknown, unknown, unknow...
## $ y         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

Convert Categorical varialbes to dummy variables using either Model.matrix or sparse.model,matrix. The \(-1\) in the formula removes the first column which is all ones.

predictors<-setdiff(names(Data),Data$y)

#names(Data)
predictors<-names(Data[,-17])

#paste0(predictors,sep="",collapse = "+ ")

#paste("~",paste0(predictors,sep="",collapse = "+ "))

as.formula(paste("y~",paste0(predictors,sep="",collapse = "+ ")))
## y ~ age + job + marital + education + default + balance + housing + 
##     loan + contact + day + month + duration + campaign + pdays + 
##     previous + poutcome
d1<-model.matrix(as.formula(paste("y~",paste0(predictors,sep="",collapse = "+ "))), Data)

d1b<-Matrix::sparse.model.matrix(as.formula(paste("~",paste0(predictors,sep="",collapse = "+ "))), Data)



#d1<-model.matrix(~age+job+marital+education + default + balance + housing + 
#    loan + contact + day + month + duration + campaign + pdays + 
#    previous + poutcome,Data)

#d1b<-Matrix::sparse.model.matrix(~age+job+marital+education + default + balance + housing + 
#    loan + contact + day + month + duration + campaign + pdays + 
#    previous + poutcome,Data)

head(d1)
##   (Intercept) age jobblue-collar jobentrepreneur jobhousemaid
## 1           1  58              0               0            0
## 2           1  44              0               0            0
## 3           1  33              0               1            0
## 4           1  47              1               0            0
## 5           1  33              0               0            0
## 6           1  35              0               0            0
##   jobmanagement jobretired jobself-employed jobservices jobstudent
## 1             1          0                0           0          0
## 2             0          0                0           0          0
## 3             0          0                0           0          0
## 4             0          0                0           0          0
## 5             0          0                0           0          0
## 6             1          0                0           0          0
##   jobtechnician jobunemployed jobunknown maritalmarried maritalsingle
## 1             0             0          0              1             0
## 2             1             0          0              0             1
## 3             0             0          0              1             0
## 4             0             0          0              1             0
## 5             0             0          1              0             1
## 6             0             0          0              1             0
##   educationsecondary educationtertiary educationunknown defaultyes balance
## 1                  0                 1                0          0    2143
## 2                  1                 0                0          0      29
## 3                  1                 0                0          0       2
## 4                  0                 0                1          0    1506
## 5                  0                 0                1          0       1
## 6                  0                 1                0          0     231
##   housingyes loanyes contacttelephone contactunknown day monthaug monthdec
## 1          1       0                0              1   5        0        0
## 2          1       0                0              1   5        0        0
## 3          1       1                0              1   5        0        0
## 4          1       0                0              1   5        0        0
## 5          0       0                0              1   5        0        0
## 6          1       0                0              1   5        0        0
##   monthfeb monthjan monthjul monthjun monthmar monthmay monthnov monthoct
## 1        0        0        0        0        0        1        0        0
## 2        0        0        0        0        0        1        0        0
## 3        0        0        0        0        0        1        0        0
## 4        0        0        0        0        0        1        0        0
## 5        0        0        0        0        0        1        0        0
## 6        0        0        0        0        0        1        0        0
##   monthsep duration campaign pdays previous poutcomeother poutcomesuccess
## 1        0      261        1    -1        0             0               0
## 2        0      151        1    -1        0             0               0
## 3        0       76        1    -1        0             0               0
## 4        0       92        1    -1        0             0               0
## 5        0      198        1    -1        0             0               0
## 6        0      139        1    -1        0             0               0
##   poutcomeunknown
## 1               1
## 2               1
## 3               1
## 4               1
## 5               1
## 6               1
head(d1b)
## 6 x 43 sparse Matrix of class "dgCMatrix"
##    [[ suppressing 43 column names '(Intercept)', 'age', 'jobblue-collar' ... ]]
##                                                                          
## 1 1 58 . . . 1 . . . . . . . 1 . . 1 . . 2143 1 . . 1 5 . . . . . . . 1 .
## 2 1 44 . . . . . . . . 1 . . . 1 1 . . .   29 1 . . 1 5 . . . . . . . 1 .
## 3 1 33 . 1 . . . . . . . . . 1 . 1 . . .    2 1 1 . 1 5 . . . . . . . 1 .
## 4 1 47 1 . . . . . . . . . . 1 . . . 1 . 1506 1 . . 1 5 . . . . . . . 1 .
## 5 1 33 . . . . . . . . . . 1 . 1 . . 1 .    1 . . . 1 5 . . . . . . . 1 .
## 6 1 35 . . . 1 . . . . . . . 1 . . 1 . .  231 1 . . 1 5 . . . . . . . 1 .
##                       
## 1 . . 261 1 -1 . . . 1
## 2 . . 151 1 -1 . . . 1
## 3 . .  76 1 -1 . . . 1
## 4 . .  92 1 -1 . . . 1
## 5 . . 198 1 -1 . . . 1
## 6 . . 139 1 -1 . . . 1
One-hot encoding

Next step, we will transform the categorical data to dummy variables. This is the one-hot encoding step.

The purpose is to transform each value of each categorical feature in a binary feature {0, 1}. The dummy conversion results in 42 variables.

#==================================================================
#convert categorical variables  to numeric variables
#==================================================================


dmy <- dummyVars(" ~ .", data = Data,fullRank = T)

transformed <- data.frame(predict(dmy, newdata =Data))


#Checking the structure of transformed train file

str(transformed)
## 'data.frame':    45211 obs. of  43 variables:
##  $ age                : num  58 44 33 47 33 35 28 42 58 43 ...
##  $ job.blue.collar    : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ job.entrepreneur   : num  0 0 1 0 0 0 0 1 0 0 ...
##  $ job.housemaid      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ job.management     : num  1 0 0 0 0 1 1 0 0 0 ...
##  $ job.retired        : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ job.self.employed  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ job.services       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ job.student        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ job.technician     : num  0 1 0 0 0 0 0 0 0 1 ...
##  $ job.unemployed     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ job.unknown        : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ marital.married    : num  1 0 1 1 0 1 0 0 1 0 ...
##  $ marital.single     : num  0 1 0 0 1 0 1 0 0 1 ...
##  $ education.secondary: num  0 1 1 0 0 0 0 0 0 1 ...
##  $ education.tertiary : num  1 0 0 0 0 1 1 1 0 0 ...
##  $ education.unknown  : num  0 0 0 1 1 0 0 0 0 0 ...
##  $ default.yes        : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ balance            : num  2143 29 2 1506 1 ...
##  $ housing.yes        : num  1 1 1 1 0 1 1 1 1 1 ...
##  $ loan.yes           : num  0 0 1 0 0 0 1 0 0 0 ...
##  $ contact.telephone  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ contact.unknown    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ day                : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ month.aug          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ month.dec          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ month.feb          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ month.jan          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ month.jul          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ month.jun          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ month.mar          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ month.may          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ month.nov          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ month.oct          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ month.sep          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ duration           : num  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign           : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays              : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome.other     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome.success   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome.unknown   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ y                  : num  0 0 0 0 0 0 0 0 0 0 ...
#==================================================================
#Converting the dependent variable back to categorical
#==================================================================

transformed$y<-as.factor(transformed$y)
#save(transformed,file="transformed.RData")
 
load("transformed.RData")
#==================================================================
#Spliting training set into two parts based on outcome: 70% and 30%
#==================================================================

index<-createDataPartition(transformed$y,p=0.70, list=FALSE)

trainSet<-transformed[index,]

testSet<-transformed[-index,]

outcomeName<-'y'

predictors<-names(trainSet)[!names(trainSet) %in% outcomeName]

predictors
##  [1] "age"                "jobblue.collar"     "jobentrepreneur"   
##  [4] "jobhousemaid"       "jobmanagement"      "jobretired"        
##  [7] "jobself.employed"   "jobservices"        "jobstudent"        
## [10] "jobtechnician"      "jobunemployed"      "jobunknown"        
## [13] "maritalmarried"     "maritalsingle"      "educationsecondary"
## [16] "educationtertiary"  "educationunknown"   "defaultyes"        
## [19] "balance"            "housingyes"         "loanyes"           
## [22] "contacttelephone"   "contactunknown"     "day"               
## [25] "monthaug"           "monthdec"           "monthfeb"          
## [28] "monthjan"           "monthjul"           "monthjun"          
## [31] "monthmar"           "monthmay"           "monthnov"          
## [34] "monthoct"           "monthsep"           "duration"          
## [37] "campaign"           "pdays"              "previous"          
## [40] "poutcomeother"      "poutcomesuccess"    "poutcomeunknown"
dim(trainSet)
## [1] 31648    43
dim(testSet)
## [1] 13563    43

The training data has 31649 rows and 43 columns whereas the test data has 13562 rows and 43 columns. The number of features 43, is soo high that feature engineering is neccessary to reduce overfitting and complexity of the models we are going to build.

Recursive Feature Selection

#==================================================================
#Feature selection using rfe in caret(recursive feature extraction)
#predictors<-names(trainSet)[!names(trainSet) %in% outcomeName]
#Alternatively
#predictors<-setdiff(names(trainSet),outcomeName)
#==================================================================

library(randomForest)

# control <- rfeControl(functions = rfFuncs,
#                       method = "repeatedcv",
#                       repeats = 3,
#                       verbose = FALSE,
#                       allowParallel = TRUE)
# outcomeName<-'y'
# 
# predictors<-names(trainSet)[!names(trainSet) %in% outcomeName]
# 
# feature_select <- rfe(trainSet[,predictors], trainSet[,outcomeName],
#                          rfeControl = control)
#save(feature_select,file="feature_select.RData")
 
load("feature_select.RData")


table(trainSet$y)
## 
##     0     1 
## 27935  3713
feature_select
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold, repeated 3 times) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD KappaSD Selected
##          4   0.9012 0.3913   0.003364 0.02533         
##          8   0.9000 0.3538   0.003068 0.02860         
##         16   0.9048 0.4450   0.003179 0.01995        *
##         42   0.9047 0.4318   0.004349 0.02868         
## 
## The top 5 variables (out of 16):
##    duration, poutcomesuccess, monthmar, contactunknown, housingyes
names(feature_select)
##  [1] "pred"         "variables"    "results"      "bestSubset"  
##  [5] "fit"          "optVariables" "optsize"      "call"        
##  [9] "control"      "resample"     "metric"       "maximize"    
## [13] "perfNames"    "times"        "resampledCM"  "obsLevels"   
## [17] "dots"
predictors(feature_select)
##  [1] "duration"        "poutcomesuccess" "monthmar"       
##  [4] "contactunknown"  "housingyes"      "age"            
##  [7] "monthoct"        "day"             "monthsep"       
## [10] "pdays"           "monthjul"        "monthaug"       
## [13] "monthmay"        "monthjun"        "monthdec"       
## [16] "campaign"
summary(feature_select)
##              Length Class        Mode     
## pred          0     -none-       NULL     
## variables     6     data.frame   list     
## results       5     data.frame   list     
## bestSubset    1     -none-       numeric  
## fit          18     randomForest list     
## optVariables 16     -none-       character
## optsize       1     -none-       numeric  
## call          4     -none-       call     
## control      14     -none-       list     
## resample      8     data.frame   list     
## metric        1     -none-       character
## maximize      1     -none-       logical  
## perfNames     2     -none-       character
## times         3     -none-       list     
## resampledCM   0     -none-       NULL     
## obsLevels     2     -none-       character
## dots          0     -none-       list
feature_select[ "bestSubset"]
## $bestSubset
## [1] 16
#The top 5 variables (out of 42):
#print("The top 5 variables (out of 42)\n")
cat("The top 5 variables (out of 42)\n")
## The top 5 variables (out of 42)
cat("duration, poutcomesuccess, monthmar, contactunknown, housingyes\n")   
## duration, poutcomesuccess, monthmar, contactunknown, housingyes
predictors(feature_select)
##  [1] "duration"        "poutcomesuccess" "monthmar"       
##  [4] "contactunknown"  "housingyes"      "age"            
##  [7] "monthoct"        "day"             "monthsep"       
## [10] "pdays"           "monthjul"        "monthaug"       
## [13] "monthmay"        "monthjun"        "monthdec"       
## [16] "campaign"
#===================================================================================
# plot variable selection
#===================================================================================


trellis.par.set(caretTheme())

plot(feature_select, type = c( "o","g"))

The top 5 variables provides an accuracy of about 90% for the data. The remaining 36 variables add less than 0.1 . This is the advantage of feature engineering. It helps to reduce complexity in the model, reduce overfitting and also computationaly time.

#===================================================================================
#Taking only the top 5 predictors
#Age, Employment.Primarily.retired..or, Education.10th.Grade, 
#Employment.Unable.to.work., race_black.Yes from feature_select
# Cs function from Hmisc converts names to character variables

#===================================================================================



library(Hmisc)

p=c(paste0(predictors(feature_select),sep=",",collapse = ""))

#trainSet[,]

predictor=Hmisc::Cs(duration,poutcome.success,month.mar,contact.unknown,age,housing.yes,day,month.oct,pdays,month.sep,month.jul,month.dec,month.may,month.aug,campaign,month.jun )




print("The set of features selected is:\n") 
## [1] "The set of features selected is:\n"
p
## [1] "duration,poutcomesuccess,monthmar,contactunknown,housingyes,age,monthoct,day,monthsep,pdays,monthjul,monthaug,monthmay,monthjun,monthdec,campaign,"
#trainSet[,outcomeName]%>%head()

#trainSet[,outcomeName]

Topfivepred = Hmisc::Cs(duration, poutcomesuccess, monthmar, contactunknown, housingyes) 

names(trainSet)
##  [1] "age"                "jobblue.collar"     "jobentrepreneur"   
##  [4] "jobhousemaid"       "jobmanagement"      "jobretired"        
##  [7] "jobself.employed"   "jobservices"        "jobstudent"        
## [10] "jobtechnician"      "jobunemployed"      "jobunknown"        
## [13] "maritalmarried"     "maritalsingle"      "educationsecondary"
## [16] "educationtertiary"  "educationunknown"   "defaultyes"        
## [19] "balance"            "housingyes"         "loanyes"           
## [22] "contacttelephone"   "contactunknown"     "day"               
## [25] "monthaug"           "monthdec"           "monthfeb"          
## [28] "monthjan"           "monthjul"           "monthjun"          
## [31] "monthmar"           "monthmay"           "monthnov"          
## [34] "monthoct"           "monthsep"           "duration"          
## [37] "campaign"           "pdays"              "previous"          
## [40] "poutcomeother"      "poutcomesuccess"    "poutcomeunknown"   
## [43] "y"

Logistic Regression Model

# trainSet$y<-as.factor(trainSet$y)
# 
# # Set up training control
#  ctrl <- trainControl(method = "repeatedcv",   # 10fold cross validation
#                       number = 5,                         # do 5 repititions of cv
#                       summaryFunction=twoClassSummary,    
#                      classProbs=TRUE,
#                      allowParallel = TRUE
#                     ) # Use AUC to pick the best model instead of default ,"Accuracy"
# 
# 
# 
# 
# 
# #This fixes problem with names when classProbs=TRUE
# trainSetglm<-if_else(trainSet[,outcomeName]==0,"No","Yes")
# 
# 
# model_glm<-train(trainSet[,Topfivepred],trainSetglm,method='glm',family="binomial",trControl=ctrl, 
#                    metric=c("ROC"))
# 
# model_glm2<-train(trainSet[,Topfivepred],trainSet[,outcomeName],method='glm',family="binomial")
# 

The accuracy of the logistic regression model about 90.4 %

#save(model_glm,file="model_glm.RData")
 
load("model_glm.RData")

#save(model_glm2,file="model_glm2.RData")
 
load("model_glm2.RData")
 summary(model_glm2)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.6320  -0.4069  -0.2862  -0.1763   3.1529  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -2.989e+00  3.847e-02  -77.69   <2e-16 ***
## duration         3.994e-03  7.338e-05   54.42   <2e-16 ***
## poutcomesuccess  2.617e+00  7.374e-02   35.49   <2e-16 ***
## monthmar         2.175e+00  1.258e-01   17.29   <2e-16 ***
## contactunknown  -1.301e+00  6.729e-02  -19.34   <2e-16 ***
## housingyes      -7.931e-01  4.373e-02  -18.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22845  on 31648  degrees of freedom
## Residual deviance: 16233  on 31643  degrees of freedom
## AIC: 16245
## 
## Number of Fisher Scoring iterations: 6
# Predict using the test data
pred<-predict(model_glm,testSet)

my_data=data.frame(cbind(predicted=pred,observed=testSet$y))

ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('logistic model')

# Print, plot variable importance
print(varImp(model_glm, scale = FALSE))
## glm variable importance
## 
##                 Overall
## duration          54.42
## poutcomesuccess   35.49
## contactunknown    19.34
## housingyes        18.14
## monthmar          17.29
plot(varImp(model_glm, scale = FALSE), main="Variable Importance using logistic/glm")

#This fixes problem with names when classProbs=TRUE
testSetglm<-if_else(testSet[,outcomeName]==0,"No","Yes")

confusionMatrix(testSetglm,pred)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  11716   271
##        Yes  1061   515
##                                           
##                Accuracy : 0.9018          
##                  95% CI : (0.8967, 0.9068)
##     No Information Rate : 0.942           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3888          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9170          
##             Specificity : 0.6552          
##          Pos Pred Value : 0.9774          
##          Neg Pred Value : 0.3268          
##              Prevalence : 0.9420          
##          Detection Rate : 0.8638          
##    Detection Prevalence : 0.8838          
##       Balanced Accuracy : 0.7861          
##                                           
##        'Positive' Class : No              
## 

We fit multiple algorithms at this step, we will compare the performance of these models by their accuracy measure. The four models we consider here are Boosted logistic regression,extreme gradient boosting,random forest and support vector machines with a radial kernel ,generalized partial least squares,bayesian regularized neural network (brnn).

#===================================================================================
# multiple algorithms
#===================================================================================




# Set up training control
fitControl <- trainControl(method = "repeatedcv",   # 10 fold cross validation
                     number = 5,                            # do 5 repititions of cv
                     summaryFunction=twoClassSummary,   # Use AUC to pick the best model
                     classProbs=TRUE,
                     allowParallel = TRUE)
#This fixes problem with names when classProbs=TRUE
trainSetglm<-if_else(trainSet[,outcomeName]==0,"No","Yes") 

# 
# met=c("LogitBoost", 'xgbTree', 'rf', 'gpls')
# 
# 
# 
# re=list()
# 
# for (i in seq_along(met)) {
#   
#  re[[i]]<-train(trainSet[,Topfivepred],trainSetglm,method=met[i],
#                 
#                     preProcess = c("center","scale"),trControl=fitControl,metric = "ROC")  
# }

Save models to avoid running the model every time.

#save(re,file="multiplealgorithms.RData")
 
load("multiplealgorithms.RData")
 
 
#re[[1]]
#re[[2]]
#re[[3]]
#re[[4]]

#re%>%as.list.data.frame()%>%kable()

Boosted Logistic Regression

re[[1]]%>%head()
## $method
## [1] "LogitBoost"
## 
## $modelInfo
## $modelInfo$label
## [1] "Boosted Logistic Regression"
## 
## $modelInfo$library
## [1] "caTools"
## 
## $modelInfo$loop
## function (grid) 
## {
##     loop <- grid[which.max(grid$nIter), , drop = FALSE]
##     submodels <- grid[-which.max(grid$nIter), , drop = FALSE]
##     submodels <- list(submodels)
##     list(loop = loop, submodels = submodels)
## }
## 
## $modelInfo$type
## [1] "Classification"
## 
## $modelInfo$parameters
##   parameter   class                 label
## 1     nIter numeric # Boosting Iterations
## 
## $modelInfo$grid
## function (x, y, len = NULL, search = "grid") 
## {
##     if (search == "grid") {
##         out <- data.frame(nIter = 1 + ((1:len) * 10))
##     }
##     else {
##         out <- data.frame(nIter = unique(sample(1:100, size = len, 
##             replace = TRUE)))
##     }
##     out
## }
## 
## $modelInfo$fit
## function (x, y, wts, param, lev, last, classProbs, ...) 
## {
##     caTools::LogitBoost(as.matrix(x), y, nIter = param$nIter)
## }
## 
## $modelInfo$predict
## function (modelFit, newdata, submodels = NULL) 
## {
##     out <- caTools::predict.LogitBoost(modelFit, newdata, type = "class")
##     if (!is.null(submodels)) {
##         tmp <- out
##         out <- vector(mode = "list", length = nrow(submodels) + 
##             1)
##         out[[1]] <- tmp
##         for (j in seq(along = submodels$nIter)) {
##             out[[j + 1]] <- caTools::predict.LogitBoost(modelFit, 
##                 newdata, nIter = submodels$nIter[j])
##         }
##     }
##     out
## }
## 
## $modelInfo$prob
## function (modelFit, newdata, submodels = NULL) 
## {
##     out <- caTools::predict.LogitBoost(modelFit, newdata, type = "raw")
##     out <- t(apply(out, 1, function(x) x/sum(x)))
##     if (!is.null(submodels)) {
##         tmp <- vector(mode = "list", length = nrow(submodels) + 
##             1)
##         tmp[[1]] <- out
##         for (j in seq(along = submodels$nIter)) {
##             tmpProb <- caTools::predict.LogitBoost(modelFit, 
##                 newdata, type = "raw", nIter = submodels$nIter[j])
##             tmpProb <- out <- t(apply(tmpProb, 1, function(x) x/sum(x)))
##             tmp[[j + 1]] <- as.data.frame(tmpProb[, modelFit$obsLevels, 
##                 drop = FALSE])
##         }
##         out <- tmp
##     }
##     out
## }
## 
## $modelInfo$predictors
## function (x, ...) 
## {
##     if (!is.null(x$xNames)) {
##         out <- unique(x$xNames[x$Stump[, "feature"]])
##     }
##     else out <- NA
##     out
## }
## 
## $modelInfo$levels
## function (x) 
## x$obsLevels
## 
## $modelInfo$tags
## [1] "Ensemble Model"             "Boosting"                  
## [3] "Implicit Feature Selection" "Tree-Based Model"          
## [5] "Logistic Regression"       
## 
## $modelInfo$sort
## function (x) 
## x[order(x[, 1]), ]
## 
## 
## $modelType
## [1] "Classification"
## 
## $results
##   nIter       ROC      Sens      Spec       ROCSD      SensSD     SpecSD
## 1    11 0.8321840 0.9721247 0.3243221 0.005703393 0.009006722 0.06403445
## 2    21 0.8437131 0.9721609 0.3153970 0.018153601 0.009202690 0.05757145
## 3    31 0.8439240 0.9619261 0.3665091 0.010009442 0.019781419 0.09638093
## 
## $pred
## NULL
## 
## $bestTune
##   nIter
## 3    31
re[[1]][[4]]%>%kable()
nIter ROC Sens Spec ROCSD SensSD SpecSD
11 0.8321840 0.9721247 0.3243221 0.0057034 0.0090067 0.0640345
21 0.8437131 0.9721609 0.3153970 0.0181536 0.0092027 0.0575714
31 0.8439240 0.9619261 0.3665091 0.0100094 0.0197814 0.0963809
plot(re[[1]])

varImp(object=re[[1]])
## ROC curve variable importance
## 
##                 Importance
## duration            100.00
## housingyes           31.53
## contactunknown       30.61
## poutcomesuccess      22.39
## monthmar              0.00
# Predict using the test data
pred<-predict(re[[1]],testSet)

my_data=data.frame(cbind(predicted=pred,observed=testSet$y))

ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('Boosted Logistic Regression')

# Print, plot variable importance
print(varImp(re[[1]], scale = FALSE))
## ROC curve variable importance
## 
##                 Importance
## duration            0.8045
## housingyes          0.6103
## contactunknown      0.6077
## poutcomesuccess     0.5844
## monthmar            0.5209
plot(varImp(re[[1]], scale = FALSE), main="Boosted Logistic Regression")

#This fixes problem with names when classProbs=TRUE
testSetglm<-if_else(testSet[,outcomeName]==0,"No","Yes")

confusionMatrix(testSetglm,pred)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  11407   580
##        Yes  1573     3
##                                          
##                Accuracy : 0.8413         
##                  95% CI : (0.835, 0.8474)
##     No Information Rate : 0.957          
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : -0.064         
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.878814       
##             Specificity : 0.005146       
##          Pos Pred Value : 0.951614       
##          Neg Pred Value : 0.001904       
##              Prevalence : 0.957015       
##          Detection Rate : 0.841038       
##    Detection Prevalence : 0.883802       
##       Balanced Accuracy : 0.441980       
##                                          
##        'Positive' Class : No             
## 
#other model plot
trellis.par.set(caretTheme())

#plot(re[[1]],plotType = "level",scales=list(x=list(rot=90)))

EXtreme Gradient Boosting

#re[[2]]

re[[2]][[4]]%>%head()%>%kable()
eta max_depth gamma colsample_bytree min_child_weight subsample nrounds ROC Sens Spec ROCSD SensSD SpecSD
1 0.3 1 0 0.6 1 0.50 50 0.8543912 0.9743433 0.3424142 0.0113690 0.0028483 0.0352174
4 0.3 1 0 0.6 1 0.75 50 0.8544852 0.9755958 0.3329675 0.0106884 0.0013653 0.0137451
7 0.3 1 0 0.6 1 1.00 50 0.8543956 0.9767766 0.3235208 0.0110311 0.0012870 0.0177499
10 0.3 1 0 0.8 1 0.50 50 0.8764375 0.9745581 0.3307991 0.0103923 0.0044108 0.0282519
13 0.3 1 0 0.8 1 0.75 50 0.8762603 0.9783152 0.3181158 0.0105884 0.0025703 0.0217097
16 0.3 1 0 0.8 1 1.00 50 0.8763053 0.9777784 0.3202794 0.0105829 0.0028286 0.0246302
plot(re[[2]])

varImp(object=re[[2]])
## xgbTree variable importance
## 
##                 Overall
## duration        100.000
## poutcomesuccess  33.837
## contactunknown   21.513
## monthmar          7.581
## housingyes        0.000
# Predict using the test data
pred<-predict(re[[2]],testSet)

my_data=data.frame(cbind(predicted=pred,observed=testSet$y))

ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('EXtreme Gradient Boosting')

# Print, plot variable importance
print(varImp(re[[2]], scale = FALSE))
## xgbTree variable importance
## 
##                 Overall
## duration        0.61376
## poutcomesuccess 0.20768
## contactunknown  0.13204
## monthmar        0.04653
## housingyes      0.00000
plot(varImp(re[[2]], scale = FALSE), main="EXtreme Gradient Boosting")

#This fixes problem with names when classProbs=TRUE
testSetglm<-if_else(testSet[,outcomeName]==0,"No","Yes")

confusionMatrix(testSetglm,pred)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  11971    16
##        Yes  1565    11
##                                           
##                Accuracy : 0.8834          
##                  95% CI : (0.8779, 0.8888)
##     No Information Rate : 0.998           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0098          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.88438         
##             Specificity : 0.40741         
##          Pos Pred Value : 0.99867         
##          Neg Pred Value : 0.00698         
##              Prevalence : 0.99801         
##          Detection Rate : 0.88262         
##    Detection Prevalence : 0.88380         
##       Balanced Accuracy : 0.64589         
##                                           
##        'Positive' Class : No              
## 
#other model plot
trellis.par.set(caretTheme())

#plot(re[[2]],plotType = "level",scales=list(x=list(rot=90)))

Random Forest

re[[3]]
## Random Forest 
## 
## 31649 samples
##     5 predictor
##     2 classes: 'No', 'Yes' 
## 
## Pre-processing: centered (5), scaled (5) 
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 25319, 25319, 25319, 25319, 25320 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##   2     0.7898092  0.9777786  0.3197410
##   3     0.8065867  0.9742718  0.3483627
##   5     0.8175060  0.9582409  0.3291888
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.
re[[3]][[4]]%>%kable()
mtry ROC Sens Spec ROCSD SensSD SpecSD
2 0.7898092 0.9777786 0.3197410 0.0059779 0.0017496 0.0067902
3 0.8065867 0.9742718 0.3483627 0.0021690 0.0031719 0.0194566
5 0.8175060 0.9582409 0.3291888 0.0083053 0.0016630 0.0133612
plot(re[[3]])

varImp(object=re[[3]])
## rf variable importance
## 
##                  Overall
## duration        100.0000
## poutcomesuccess  17.3343
## housingyes        3.5630
## contactunknown    0.3452
## monthmar          0.0000
# Predict using the test data
pred<-predict(re[[3]],testSet)

my_data=data.frame(cbind(predicted=pred,observed=testSet$y))

ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('Random Forest')

# Print, plot variable importance
print(varImp(re[[3]], scale = FALSE))
## rf variable importance
## 
##                 Overall
## duration         2803.2
## poutcomesuccess   577.7
## housingyes        206.9
## contactunknown    120.3
## monthmar          111.0
plot(varImp(re[[3]], scale = FALSE), main="Random Forest")

#This fixes problem with names when classProbs=TRUE
testSetglm<-if_else(testSet[,outcomeName]==0,"No","Yes")

confusionMatrix(testSetglm,pred)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  11722   265
##        Yes   843   733
##                                           
##                Accuracy : 0.9183          
##                  95% CI : (0.9136, 0.9229)
##     No Information Rate : 0.9264          
##     P-Value [Acc > NIR] : 0.9998          
##                                           
##                   Kappa : 0.5269          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9329          
##             Specificity : 0.7345          
##          Pos Pred Value : 0.9779          
##          Neg Pred Value : 0.4651          
##              Prevalence : 0.9264          
##          Detection Rate : 0.8643          
##    Detection Prevalence : 0.8838          
##       Balanced Accuracy : 0.8337          
##                                           
##        'Positive' Class : No              
## 
#other model plot
trellis.par.set(caretTheme())

#plot(re[[3]],plotType = "level",scales=list(x=list(rot=90)))

Support Vector Machines with Radial Basis Function Kernel

# Set up training control
fitControl <- trainControl(method = "repeatedcv",   # 10 fold cross validation
                     number = 5,                            # do 5 repititions of cv
                     summaryFunction=twoClassSummary,   # Use AUC to pick the best model
                     classProbs=TRUE,
                     allowParallel = TRUE)
#This fixes problem with names when classProbs=TRUE
trainSetglm<-if_else(trainSet[,outcomeName]==0,"No","Yes") 

#model_svm<-train(trainSet[,Topfivepred],trainSetglm,method='svmRadial',
#                  trControl=fitControl,preProcess = c("center","scale"))
#
#predictions<-predict.train(object=re[[4]],testSet[,predictors],type="raw")

#predictions
#save(model_svm,file="model_svm.RData")
 
load("model_svm.RData")

model_svm
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 31649 samples
##     5 predictor
##     2 classes: 'No', 'Yes' 
## 
## Pre-processing: centered (5), scaled (5) 
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 25320, 25320, 25319, 25318, 25319 
## Resampling results across tuning parameters:
## 
##   C     ROC        Sens       Spec     
##   0.25  0.6232691  0.9791866  0.3190709
##   0.50  0.7684297  0.9778859  0.3183791
##   1.00  0.7668203  0.9779217  0.3154112
## 
## Tuning parameter 'sigma' was held constant at a value of 3.715887
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 3.715887 and C = 0.5.
ggplot(model_svm)

#other model plot
trellis.par.set(caretTheme())

#plot(model_svm,plotType = "level",scales=list(x=list(rot=90)))


varImp(object=model_svm)
## ROC curve variable importance
## 
##                 Importance
## duration            100.00
## housingyes           31.53
## contactunknown       30.61
## poutcomesuccess      22.39
## monthmar              0.00
# Predict using the test data
#pred<-predict(model_svm,testSet)

#predictions<-predict(object=model_svm,testSet[,Topfivepred],type="raw")

#predictions<-caret::predict.train(object=model_svm,testSet[,Topfivepred],type="raw")

#my_data=data.frame(cbind(predicted=predictions,observed=testSet$y))

#ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('Support Vector Machines with Radial Basis Function Kernel')

# Print, plot variable importance
print(varImp(model_svm, scale = FALSE))
## ROC curve variable importance
## 
##                 Importance
## duration            0.8045
## housingyes          0.6103
## contactunknown      0.6077
## poutcomesuccess     0.5844
## monthmar            0.5209
plot(varImp(model_svm, scale = FALSE), main="Support Vector Machines with Radial Basis Function Kernel")

# Print, plot variable importance
print(varImp(model_svm, scale = FALSE))
## ROC curve variable importance
## 
##                 Importance
## duration            0.8045
## housingyes          0.6103
## contactunknown      0.6077
## poutcomesuccess     0.5844
## monthmar            0.5209
#This fixes problem with names when classProbs=TRUE
testSetglm<-if_else(testSet[,outcomeName]==0,"No","Yes")

#confusionMatrix(testSetglm,predictions)



#predictions%>%head()


#other model plot
trellis.par.set(caretTheme())

#plot(re[[3]],plotType = "level",scales=list(x=list(rot=90)))

Find models that are supported by the caret package. There are over 200 models that can be implemented in the caret package at last count.

names(getModelInfo())%>%head()
## [1] "AdaBag"      "AdaBoost.M1" "adaboost"    "ada"         "amdai"      
## [6] "ANFIS"
#===================================================================================
#gradient boosted trees
# parameter tuning
#===================================================================================


fitControl <- trainControl(method = "repeatedcv",number = 5,repeats = 5,
                           allowParallel = TRUE )

#repeat=5,  5 separate sample 5 fold cross validation because number =10


#fitControl <- trainControl(
#  method = "repeatedcv",
#  number = 5,classProbs = TRUE,
#  repeats = 5,allowParallel = TRUE)

#Creating grid
#grid <- expand.grid(n.trees=c(10,20,50,100,500,1000),   # Number of  trees to fit,
                    # shrinkage=c(0.01,0.05,0.1,0.5),
                    # n.minobsinnode = c(3,5,10),
                    # interaction.depth=c(1,5,10)  # Depth of variable interactions)
                    # 
                    # 
                    
grid <- expand.grid(interaction.depth=c(1,5,10), # Depth of variable interactions
                    n.trees=c(10,20,50,100),            # Number  trees to fit
                    shrinkage=c(0.01,0.05,0.1,0.5),     # Try 2 values for learning rate 
                    n.minobsinnode = c(3,5,10))                    
                    
                    
# training the model
#model_gbm<-train(trainSet[,Topfivepred],trainSet[,outcomeName],method='gbm',trControl=fitControl,
#                 tuneGrid=grid,preProcess = c("center","scale"))
# summarizing the model
modelLookup(model='gbm')
#model_gbm%>%as.list.data.frame()%>%kable()
#save(model_gbm,file="model_gbm.RData")
 
load("model_gbm.RData")

print(model_gbm)%>%head() 
## Stochastic Gradient Boosting 
## 
## 31649 samples
##    16 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 25320, 25320, 25318, 25319, 25319, 25319, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.minobsinnode  n.trees  Accuracy 
##   0.01        1                  3                10     0.8829979
##   0.01        1                  3                20     0.8829979
##   0.01        1                  3                50     0.8829979
##   0.01        1                  3               100     0.8848242
##   0.01        1                  3               500     0.8934753
##   0.01        1                  3              1000     0.8984422
##   0.01        1                  5                10     0.8829979
##   0.01        1                  5                20     0.8829979
##   0.01        1                  5                50     0.8829979
##   0.01        1                  5               100     0.8847926
##   0.01        1                  5               500     0.8931783
##   0.01        1                  5              1000     0.8982463
##   0.01        1                 10                10     0.8829979
##   0.01        1                 10                20     0.8829979
##   0.01        1                 10                50     0.8829979
##   0.01        1                 10               100     0.8847736
##   0.01        1                 10               500     0.8931846
##   0.01        1                 10              1000     0.8982969
##   0.01        5                  3                10     0.8829979
##   0.01        5                  3                20     0.8829979
##   0.01        5                  3                50     0.8829979
##   0.01        5                  3               100     0.8839395
##   0.01        5                  3               500     0.9031945
##   0.01        5                  3              1000     0.9042624
##   0.01        5                  5                10     0.8829979
##   0.01        5                  5                20     0.8829979
##   0.01        5                  5                50     0.8829979
##   0.01        5                  5               100     0.8839269
##   0.01        5                  5               500     0.9032956
##   0.01        5                  5              1000     0.9042372
##   0.01        5                 10                10     0.8829979
##   0.01        5                 10                20     0.8829979
##   0.01        5                 10                50     0.8829979
##   0.01        5                 10               100     0.8839647
##   0.01        5                 10               500     0.9032324
##   0.01        5                 10              1000     0.9045215
##   0.01       10                  3                10     0.8829979
##   0.01       10                  3                20     0.8829979
##   0.01       10                  3                50     0.8829979
##   0.01       10                  3               100     0.8894309
##   0.01       10                  3               500     0.9051219
##   0.01       10                  3              1000     0.9055390
##   0.01       10                  5                10     0.8829979
##   0.01       10                  5                20     0.8829979
##   0.01       10                  5                50     0.8829979
##   0.01       10                  5               100     0.8891530
##   0.01       10                  5               500     0.9051914
##   0.01       10                  5              1000     0.9056780
##   0.01       10                 10                10     0.8829979
##   0.01       10                 10                20     0.8829979
##   0.01       10                 10                50     0.8829979
##   0.01       10                 10               100     0.8893614
##   0.01       10                 10               500     0.9052546
##   0.01       10                 10              1000     0.9059245
##   0.05        1                  3                10     0.8829979
##   0.05        1                  3                20     0.8849885
##   0.05        1                  3                50     0.8880660
##   0.05        1                  3               100     0.8936269
##   0.05        1                  3               500     0.9009384
##   0.05        1                  3              1000     0.9020001
##   0.05        1                  5                10     0.8829979
##   0.05        1                  5                20     0.8849695
##   0.05        1                  5                50     0.8879965
##   0.05        1                  5               100     0.8939429
##   0.05        1                  5               500     0.9007804
##   0.05        1                  5              1000     0.9021897
##   0.05        1                 10                10     0.8829979
##   0.05        1                 10                20     0.8850517
##   0.05        1                 10                50     0.8879712
##   0.05        1                 10               100     0.8934943
##   0.05        1                 10               500     0.9007804
##   0.05        1                 10              1000     0.9019495
##   0.05        5                  3                10     0.8829979
##   0.05        5                  3                20     0.8854181
##   0.05        5                  3                50     0.9005530
##   0.05        5                  3               100     0.9032576
##   0.05        5                  3               500     0.9049134
##   0.05        5                  3              1000     0.9048060
##   0.05        5                  5                10     0.8829979
##   0.05        5                  5                20     0.8852918
##   0.05        5                  5                50     0.9009195
##   0.05        5                  5               100     0.9030554
##   0.05        5                  5               500     0.9054442
##   0.05        5                  5              1000     0.9055200
##   0.05        5                 10                10     0.8829979
##   0.05        5                 10                20     0.8848116
##   0.05        5                 10                50     0.9007299
##   0.05        5                 10               100     0.9028722
##   0.05        5                 10               500     0.9049892
##   0.05        5                 10              1000     0.9048439
##   0.05       10                  3                10     0.8829979
##   0.05       10                  3                20     0.8904736
##   0.05       10                  3                50     0.9027395
##   0.05       10                  3               100     0.9045216
##   0.05       10                  3               500     0.9056465
##   0.05       10                  3              1000     0.9045279
##   0.05       10                  5                10     0.8829979
##   0.05       10                  5                20     0.8916996
##   0.05       10                  5                50     0.9022655
##   0.05       10                  5               100     0.9048249
##   0.05       10                  5               500     0.9055832
##   0.05       10                  5              1000     0.9042309
##   0.05       10                 10                10     0.8829979
##   0.05       10                 10                20     0.8912509
##   0.05       10                 10                50     0.9025941
##   0.05       10                 10               100     0.9046669
##   0.05       10                 10               500     0.9056464
##   0.05       10                 10              1000     0.9047112
##   0.10        1                  3                10     0.8854056
##   0.10        1                  3                20     0.8874214
##   0.10        1                  3                50     0.8941199
##   0.10        1                  3               100     0.8984107
##   0.10        1                  3               500     0.9020633
##   0.10        1                  3              1000     0.9020443
##   0.10        1                  5                10     0.8850770
##   0.10        1                  5                20     0.8872824
##   0.10        1                  5                50     0.8941009
##   0.10        1                  5               100     0.8987709
##   0.10        1                  5               500     0.9020001
##   0.10        1                  5              1000     0.9024614
##   0.10        1                 10                10     0.8853550
##   0.10        1                 10                20     0.8873330
##   0.10        1                 10                50     0.8940566
##   0.10        1                 10               100     0.8986066
##   0.10        1                 10               500     0.9018674
##   0.10        1                 10              1000     0.9021139
##   0.10        5                  3                10     0.8875099
##   0.10        5                  3                20     0.8986066
##   0.10        5                  3                50     0.9026320
##   0.10        5                  3               100     0.9039591
##   0.10        5                  3               500     0.9047364
##   0.10        5                  3              1000     0.9040160
##   0.10        5                  5                10     0.8867073
##   0.10        5                  5                20     0.8986951
##   0.10        5                  5                50     0.9028469
##   0.10        5                  5               100     0.9042245
##   0.10        5                  5               500     0.9049323
##   0.10        5                  5              1000     0.9037190
##   0.10        5                 10                10     0.8875098
##   0.10        5                 10                20     0.8979999
##   0.10        5                 10                50     0.9031123
##   0.10        5                 10               100     0.9040728
##   0.10        5                 10               500     0.9048754
##   0.10        5                 10              1000     0.9043572
##   0.10       10                  3                10     0.8925590
##   0.10       10                  3                20     0.9012102
##   0.10       10                  3                50     0.9048123
##   0.10       10                  3               100     0.9052546
##   0.10       10                  3               500     0.9035611
##   0.10       10                  3              1000     0.9017663
##   0.10       10                  5                10     0.8929381
##   0.10       10                  5                20     0.9013681
##   0.10       10                  5                50     0.9047743
##   0.10       10                  5               100     0.9053431
##   0.10       10                  5               500     0.9038201
##   0.10       10                  5              1000     0.9018927
##   0.10       10                 10                10     0.8928560
##   0.10       10                 10                20     0.9011028
##   0.10       10                 10                50     0.9039907
##   0.10       10                 10               100     0.9050966
##   0.10       10                 10               500     0.9037506
##   0.10       10                 10              1000     0.9020886
##   0.50        1                  3                10     0.8967866
##   0.50        1                  3                20     0.8991690
##   0.50        1                  3                50     0.9008057
##   0.50        1                  3               100     0.9015893
##   0.50        1                  3               500     0.9020317
##   0.50        1                  3              1000     0.9017536
##   0.50        1                  5                10     0.8962305
##   0.50        1                  5                20     0.8993459
##   0.50        1                  5                50     0.9005782
##   0.50        1                  5               100     0.9019116
##   0.50        1                  5               500     0.9020569
##   0.50        1                  5              1000     0.9018042
##   0.50        1                 10                10     0.8968182
##   0.50        1                 10                20     0.8996050
##   0.50        1                 10                50     0.9008563
##   0.50        1                 10               100     0.9016272
##   0.50        1                 10               500     0.9016778
##   0.50        1                 10              1000     0.9019243
##   0.50        5                  3                10     0.9005720
##   0.50        5                  3                20     0.9016715
##   0.50        5                  3                50     0.9016399
##   0.50        5                  3               100     0.9011028
##   0.50        5                  3               500     0.8955165
##   0.50        5                  3              1000     0.8935006
##   0.50        5                  5                10     0.9007678
##   0.50        5                  5                20     0.9020317
##   0.50        5                  5                50     0.9013050
##   0.50        5                  5               100     0.9007615
##   0.50        5                  5               500     0.8957882
##   0.50        5                  5              1000     0.8922620
##   0.50        5                 10                10     0.9009890
##   0.50        5                 10                20     0.9015704
##   0.50        5                 10                50     0.9022719
##   0.50        5                 10               100     0.9018611
##   0.50        5                 10               500     0.8978357
##   0.50        5                 10              1000     0.8947202
##   0.50       10                  3                10     0.9019496
##   0.50       10                  3                20     0.9017032
##   0.50       10                  3                50     0.8990744
##   0.50       10                  3               100     0.8976525
##   0.50       10                  3               500     0.8911499
##   0.50       10                  3              1000     0.8903411
##   0.50       10                  5                10     0.9016716
##   0.50       10                  5                20     0.9011596
##   0.50       10                  5                50     0.9004139
##   0.50       10                  5               100     0.8984234
##   0.50       10                  5               500     0.8906506
##   0.50       10                  5              1000     0.8879459
##   0.50       10                 10                10     0.9008184
##   0.50       10                 10                20     0.9005025
##   0.50       10                 10                50     0.9001107
##   0.50       10                 10               100     0.8987583
##   0.50       10                 10               500     0.8928433
##   0.50       10                 10              1000     0.8903283
##   Kappa     
##   0.00000000
##   0.00000000
##   0.00000000
##   0.03330150
##   0.20288362
##   0.32936482
##   0.00000000
##   0.00000000
##   0.00000000
##   0.03315239
##   0.19728622
##   0.32813486
##   0.00000000
##   0.00000000
##   0.00000000
##   0.03271344
##   0.19769242
##   0.32838124
##   0.00000000
##   0.00000000
##   0.00000000
##   0.01979491
##   0.42653711
##   0.45011170
##   0.00000000
##   0.00000000
##   0.00000000
##   0.02003560
##   0.42854145
##   0.44991999
##   0.00000000
##   0.00000000
##   0.00000000
##   0.02081668
##   0.42706174
##   0.45227308
##   0.00000000
##   0.00000000
##   0.00000000
##   0.12483824
##   0.45763732
##   0.47169234
##   0.00000000
##   0.00000000
##   0.00000000
##   0.11868765
##   0.45869706
##   0.47305723
##   0.00000000
##   0.00000000
##   0.00000000
##   0.12266339
##   0.45999223
##   0.47374887
##   0.00000000
##   0.03655032
##   0.09264904
##   0.21205073
##   0.40294661
##   0.41960354
##   0.00000000
##   0.03578969
##   0.09164590
##   0.22013113
##   0.40225742
##   0.42208314
##   0.00000000
##   0.03729892
##   0.09105956
##   0.20667468
##   0.40182904
##   0.42004092
##   0.00000000
##   0.04801892
##   0.37246906
##   0.42804343
##   0.46474308
##   0.46687662
##   0.00000000
##   0.04640580
##   0.37491578
##   0.42611118
##   0.46715974
##   0.47151185
##   0.00000000
##   0.03792738
##   0.37450261
##   0.42527966
##   0.46445271
##   0.46730406
##   0.00000000
##   0.14674980
##   0.41009792
##   0.45481343
##   0.47790108
##   0.47523556
##   0.00000000
##   0.16658088
##   0.40525578
##   0.45726659
##   0.47661506
##   0.47428494
##   0.00000000
##   0.15373679
##   0.40903277
##   0.45658443
##   0.47716293
##   0.47609486
##   0.04317232
##   0.07941222
##   0.22375317
##   0.33729930
##   0.42025078
##   0.42356660
##   0.03836378
##   0.07759281
##   0.23054029
##   0.33699293
##   0.42013753
##   0.42609111
##   0.04215116
##   0.07858481
##   0.22110731
##   0.33530467
##   0.41907190
##   0.42396224
##   0.08890398
##   0.32503865
##   0.42544827
##   0.44885653
##   0.46799809
##   0.46960602
##   0.07482536
##   0.32493037
##   0.42734017
##   0.45188310
##   0.46901781
##   0.46754899
##   0.08443919
##   0.31742895
##   0.42815160
##   0.44947247
##   0.46691791
##   0.46932781
##   0.18377223
##   0.38021718
##   0.45552578
##   0.47022117
##   0.47140424
##   0.46760740
##   0.18973125
##   0.38175534
##   0.45623593
##   0.47034148
##   0.47194481
##   0.46867213
##   0.18571495
##   0.37819253
##   0.45303490
##   0.46951103
##   0.47137584
##   0.46752728
##   0.31667335
##   0.38431114
##   0.40855693
##   0.41825851
##   0.42669211
##   0.42903329
##   0.31258808
##   0.38136717
##   0.40627719
##   0.42026914
##   0.42758466
##   0.42684539
##   0.31990239
##   0.38729044
##   0.40874817
##   0.41875263
##   0.42538371
##   0.43090132
##   0.42062776
##   0.43726032
##   0.44544334
##   0.44900952
##   0.44070864
##   0.43705075
##   0.41968157
##   0.43907408
##   0.44614069
##   0.44614137
##   0.44286235
##   0.43510905
##   0.42667586
##   0.43701145
##   0.44596172
##   0.45328366
##   0.44880505
##   0.44307079
##   0.44215422
##   0.44939603
##   0.44175308
##   0.44406976
##   0.43460273
##   0.43098439
##   0.44498464
##   0.44639614
##   0.45068842
##   0.44997511
##   0.43465401
##   0.43020531
##   0.43897011
##   0.44408071
##   0.44870868
##   0.44822852
##   0.43872644
##   0.43528820
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 1000,
##  interaction.depth = 10, shrinkage = 0.01 and n.minobsinnode = 10.
##  shrinkage interaction.depth n.minobsinnode n.trees Accuracy   
##  "0.01"    " 1"              " 3"           "  10"  "0.8829979"
##  "0.01"    " 1"              " 3"           "  20"  "0.8829979"
##  "0.01"    " 1"              " 3"           "  50"  "0.8829979"
##  "0.01"    " 1"              " 3"           " 100"  "0.8848242"
##  "0.01"    " 1"              " 3"           " 500"  "0.8934753"
##  "0.01"    " 1"              " 3"           "1000"  "0.8984422"
##  Kappa       
##  "0.00000000"
##  "0.00000000"
##  "0.00000000"
##  "0.03330150"
##  "0.20288362"
##  "0.32936482"
model_gbm$bestTune%>%kable()
n.trees interaction.depth shrinkage n.minobsinnode
54 1000 10 0.01 10
model_gbm$results%>%head()%>%kable()
shrinkage interaction.depth n.minobsinnode n.trees Accuracy Kappa AccuracySD KappaSD
1 0.01 1 3 10 0.8829979 0 7.02e-05 0
7 0.01 1 5 10 0.8829979 0 7.02e-05 0
13 0.01 1 10 10 0.8829979 0 7.02e-05 0
55 0.05 1 3 10 0.8829979 0 7.02e-05 0
61 0.05 1 5 10 0.8829979 0 7.02e-05 0
67 0.05 1 10 10 0.8829979 0 7.02e-05 0
#various of finding the row with maximum accuracy

model_gbm$results[which.max(model_gbm$results$Accuracy),]   
model_gbm$results%>%filter()%>%dplyr::summarise(max1=max(Accuracy))
model_gbm$results %>% dplyr::slice(which.min(Accuracy ))
model_gbm$results%>%dplyr::slice(which.max(Accuracy ))
model_gbm$results[ which(model_gbm$results$Accuracy ==max(model_gbm$results$Accuracy))  ,]
plot(model_gbm)

# 
# pred<-predict(model_gbm,iris_test)
# 
# Conf_matrix<-confusionMatrix(pred,iris[1:5,5])
# 
# kable(Conf_matrix$table)

#trellis.par.set(caretTheme())
#plot(model_nnet,metric = "kappa",plotType = "level",scales=list(x=list(rot=90)))

#other model plot
trellis.par.set(caretTheme())

plot(model_gbm,plotType = "level",scales=list(x=list(rot=90)))

#using tune length

#fitControl <- trainControl(method = "repeatedcv",number = 5,repeats = 5,
 #                          allowParallel = TRUE )


# Set up training control
fitControl  <- trainControl(method = "repeatedcv",   # 10fold cross validation
                     number = 5,                            # do 5 repititions of cv
                     summaryFunction=twoClassSummary,   # Use AUC to pick the best model
                     classProbs=TRUE,
                     allowParallel = TRUE)
 
#This fixes problem with names when classProbs=TRUE
trainSetglm<-if_else(trainSet[,outcomeName]==0,"No","Yes")

#model_gbm2<-train(trainSet[,Topfivepred],trainSetglm,method='gbm',
#               preProcess = c("center","scale"),   trControl=fitControl,metric = "ROC")

We can use tuneLength instead of specifying the value of each parameter. This allows any number of possible values for each tuning parameter through tuneLength.

# model_gbm3<-train(trainSet[,Topfivepred],trainSet[,outcomeName],method='gbm',
#                   trControl=fitControl,interaction.depth=10,n.trees=100,n.minobsinnode=10)
# print(model_gbm)
#save(model_gbm2,file="model_gbm2.RData")
 
load("model_gbm2.RData")

print(model_gbm2)
## Stochastic Gradient Boosting 
## 
## 31649 samples
##     5 predictor
##     2 classes: 'No', 'Yes' 
## 
## Pre-processing: centered (5), scaled (5) 
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 25320, 25319, 25320, 25318, 25319 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  ROC        Sens       Spec     
##   1                   50      0.8769229  0.9891934  0.1909235
##   1                  100      0.8839802  0.9765976  0.3197414
##   1                  150      0.8861147  0.9744507  0.3418784
##   2                   50      0.8844877  0.9733413  0.3440376
##   2                  100      0.8875928  0.9709081  0.3734716
##   2                  150      0.8881013  0.9708724  0.3729350
##   3                   50      0.8868124  0.9721248  0.3618594
##   3                  100      0.8882853  0.9716596  0.3651027
##   3                  150      0.8884538  0.9723394  0.3578156
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
#plot the model metrics
trellis.par.set(caretTheme())

plot(model_gbm2)

#varImp(object=model_gbm2)


# Predict using the test data
pred<-predict(model_gbm2,testSet)

my_data=data.frame(cbind(predicted=pred,observed=testSet$y))

ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('Stochastic Gradient Boosting Machine')

# Print, plot variable importance
#print(varImp(model_gbm2, scale = FALSE))

#plot(varImp(model_gbm2, scale = FALSE), main="Stochastic Gradient Boosting")



#This fixes problem with names when classProbs=TRUE
testSetglm<-if_else(testSet[,outcomeName]==0,"No","Yes")

confusionMatrix(testSetglm,pred)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  11971    16
##        Yes  1565    11
##                                           
##                Accuracy : 0.8834          
##                  95% CI : (0.8779, 0.8888)
##     No Information Rate : 0.998           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0098          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.88438         
##             Specificity : 0.40741         
##          Pos Pred Value : 0.99867         
##          Neg Pred Value : 0.00698         
##              Prevalence : 0.99801         
##          Detection Rate : 0.88262         
##    Detection Prevalence : 0.88380         
##       Balanced Accuracy : 0.64589         
##                                           
##        'Positive' Class : No              
## 
#other model plot
trellis.par.set(caretTheme())
#plot(model_gbm2,metric = "kappa",plotType = "level",scales=list(x=list(rot=90)))


ggplot(model_gbm2)
## Warning: Ignoring unknown aesthetics: shape

#other model plot
trellis.par.set(caretTheme())

plot(model_gbm2,plotType = "level",scales=list(x=list(rot=90)))

The maximum accuracy of 0.9059 occurs at these parameter combinations shrinkage=0.01,interaction.depth=10,n.minobsinnode=10 and n.trees=1000. The mew model will be fitted with these parameter values.

Neural networks

#This fixes problem with names when classProbs=TRUE
trainSetglm<-if_else(trainSet[,outcomeName]==0,"No","Yes")

#model_nnet<-train(trainSet[,Topfivepred],trainSetglm,method="nnet",trControl=fitControl,
#                  preProcess = c("center","scale"),metric = "ROC")
#save(model_nnet,file="model_nnet.RData")
 
load("model_nnet.RData")

print(model_nnet)
## Neural Network 
## 
## 31649 samples
##     5 predictor
##     2 classes: 'No', 'Yes' 
## 
## Pre-processing: centered (5), scaled (5) 
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 25319, 25319, 25320, 25320, 25318 
## Resampling results across tuning parameters:
## 
##   size  decay  ROC        Sens       Spec     
##   1     0e+00  0.8851818  0.9647894  0.3918467
##   1     1e-04  0.8855445  0.9655766  0.3915764
##   1     1e-01  0.8864474  0.9663996  0.3980563
##   3     0e+00  0.8881036  0.9695129  0.3707714
##   3     1e-04  0.8890963  0.9716239  0.3586264
##   3     1e-01  0.8896989  0.9737709  0.3535069
##   5     0e+00  0.8888763  0.9731625  0.3591753
##   5     1e-04  0.8899594  0.9738068  0.3586202
##   5     1e-01  0.8901357  0.9744151  0.3540398
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were size = 5 and decay = 0.1.
ggplot(model_nnet)
## Warning: Ignoring unknown aesthetics: shape

#other model plot
trellis.par.set(caretTheme())

plot(model_nnet,plotType = "level",scales=list(x=list(rot=90)))

varImp(object=model_nnet)
## nnet variable importance
## 
##                 Overall
## duration        100.000
## poutcomesuccess  35.001
## contactunknown   34.089
## housingyes        1.523
## monthmar          0.000
# Predict using the test data
pred<-predict(model_nnet,testSet)

my_data=data.frame(cbind(predicted=pred,observed=testSet$y))

ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('Stochastic Gradient Boosting Machine')

# Print, plot variable importance
print(varImp(model_nnet, scale = FALSE))
## nnet variable importance
## 
##                 Overall
## duration         41.478
## poutcomesuccess  20.286
## contactunknown   19.989
## housingyes        9.371
## monthmar          8.875
plot(varImp(model_nnet, scale = FALSE), main="Stochastic Gradient Boosting")

#This fixes problem with names when classProbs=TRUE
testSetglm<-if_else(testSet[,outcomeName]==0,"No","Yes")

confusionMatrix(testSetglm,pred)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  11694   293
##        Yes  1018   558
##                                           
##                Accuracy : 0.9033          
##                  95% CI : (0.8982, 0.9083)
##     No Information Rate : 0.9373          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4119          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9199          
##             Specificity : 0.6557          
##          Pos Pred Value : 0.9756          
##          Neg Pred Value : 0.3541          
##              Prevalence : 0.9373          
##          Detection Rate : 0.8622          
##    Detection Prevalence : 0.8838          
##       Balanced Accuracy : 0.7878          
##                                           
##        'Positive' Class : No              
## 

Chossing Between Models

We try to choose between the models based on metrics such as area under a curve measure ROC which is very popular for binary classification problems,Sensitivity and Specifivity.The accuracy measure is another alternative to picking the best model although in binary classification problems the ROC is a better criteria.

#table$model=data.frame(model=c(" Gradient Boosting Machine","Random Forst","Boosted Logistic Regression","Extreme Gradient Boosting  Machine","Logistic Regression","Neural Networks"))


data_frame(model=c(" Stochastic Gradient Boosting Machine","Random Forest","Boosted Logistic Regression","Extreme Gradient Boosting  Machine","Logistic Regression","Neural Networks"),
           accuracy=c(0.8832,0.9174,0.8416,0.8832  ,0.901,0.9038 ))
######################################################################################

# CHOOSING BETWEEN MODELS

######################################################################################



resamps <- resamples(list(GBM = model_gbm2,
                  #SVM = model_svm,
                          RF=re[[3]],
                          BLR=re[[1]],
                          EGBM=re[[2]],
                          GLM=model_glm,
                          NNET = model_nnet))
resamps
## 
## Call:
## resamples.default(x = list(GBM = model_gbm2, RF = re[[3]], BLR =
##  re[[1]], EGBM = re[[2]], GLM = model_glm, NNET = model_nnet))
## 
## Models: GBM, RF, BLR, EGBM, GLM, NNET 
## Number of resamples: 5 
## Performance metrics: ROC, Sens, Spec 
## Time estimates for: everything, final model fit
summary(resamps)
## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: GBM, RF, BLR, EGBM, GLM, NNET 
## Number of resamples: 5 
## 
## ROC 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## GBM  0.8816546 0.8833298 0.8903776 0.8884538 0.8929705 0.8939366    0
## RF   0.8035906 0.8177582 0.8195248 0.8175060 0.8210548 0.8256018    0
## BLR  0.8319502 0.8358015 0.8450984 0.8439240 0.8510384 0.8557315    0
## EGBM 0.8652316 0.8754254 0.8759783 0.8788672 0.8844292 0.8932713    0
## GLM  0.8796846 0.8821885 0.8827486 0.8838859 0.8850330 0.8897746    0
## NNET 0.8793133 0.8898545 0.8919580 0.8901357 0.8943596 0.8951932    0
## 
## Sens 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## GBM  0.9699410 0.9699410 0.9728037 0.9723394 0.9744140 0.9745975    0
## RF   0.9559850 0.9572374 0.9586688 0.9582409 0.9590340 0.9602791    0
## BLR  0.9296833 0.9593845 0.9649311 0.9619261 0.9762075 0.9794239    0
## EGBM 0.9728037 0.9749508 0.9762033 0.9762040 0.9767400 0.9803220    0
## GLM  0.9735194 0.9736983 0.9745975 0.9762757 0.9792449 0.9803185    0
## NNET 0.9704777 0.9711986 0.9749508 0.9744151 0.9769189 0.9785293    0
## 
## Spec 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## GBM  0.3454791 0.3481781 0.3527027 0.3578156 0.3540541 0.3886640    0
## RF   0.3198381 0.3216216 0.3229730 0.3291888 0.3292848 0.3522267    0
## BLR  0.2780027 0.3009447 0.3373819 0.3665091 0.3972973 0.5189189    0
## EGBM 0.3090418 0.3162162 0.3270270 0.3324273 0.3360324 0.3738192    0
## GLM  0.2995951 0.3067568 0.3162162 0.3229752 0.3360324 0.3562753    0
## NNET 0.3306343 0.3576248 0.3581081 0.3540398 0.3594595 0.3643725    0
theme1 <- trellis.par.get()
theme1$plot.symbol$col = rgb(.2, .2, .2, .4)
theme1$plot.symbol$pch = 16
theme1$plot.line$col = rgb(1, 0, 0, .7)
theme1$plot.line$lwd <- 2
trellis.par.set(theme1)
bwplot(resamps, layout = c(3, 1))

The neural network with 4 hidden layers provides the highest ROC(receiver operating characteristic curve).The closer the ROC value is to 1,the best the model.It also has one of the highest accuracy measure of the models trained.

trellis.par.set(caretTheme())
dotplot(resamps, metric = "ROC")

trellis.par.set(theme1)
xyplot(resamps, what = "BlandAltman")

splom(resamps)

######################################################################################

# CHOOSING BETWEEN MODELS

######################################################################################



difValues <- diff(resamps)
difValues
## 
## Call:
## diff.resamples(x = resamps)
## 
## Models: GBM, RF, BLR, EGBM, GLM, NNET 
## Metrics: ROC, Sens, Spec 
## Number of differences: 15 
## p-value adjustment: bonferroni
summary(difValues)
## 
## Call:
## summary.diff.resamples(object = difValues)
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## ROC 
##      GBM       RF        BLR       EGBM      GLM       NNET     
## GBM             0.070948  0.044530  0.009587  0.004568 -0.001682
## RF   0.0022120           -0.026418 -0.061361 -0.066380 -0.072630
## BLR  0.0135805 0.1139522           -0.034943 -0.039962 -0.046212
## EGBM 1.0000000 0.0079713 0.2775045           -0.005019 -0.011269
## GLM  1.0000000 0.0003999 0.0294592 1.0000000           -0.006250
## NNET 1.0000000 1.48e-05  0.0131094 1.0000000 1.0000000          
## 
## Sens 
##      GBM     RF         BLR        EGBM       GLM        NNET      
## GBM           1.410e-02  1.041e-02 -3.865e-03 -3.936e-03 -2.076e-03
## RF   0.01337            -3.685e-03 -1.796e-02 -1.803e-02 -1.617e-02
## BLR  1.00000 1.00000               -1.428e-02 -1.435e-02 -1.249e-02
## EGBM 0.76082 0.00154    1.00000               -7.177e-05  1.789e-03
## GLM  1.00000 0.01147    1.00000    1.00000                1.861e-03
## NNET 1.00000 0.03067    1.00000    1.00000    1.00000              
## 
## Spec 
##      GBM     RF        BLR       EGBM      GLM       NNET     
## GBM           0.028627 -0.008694  0.025388  0.034840  0.003776
## RF   1.00000           -0.037320 -0.003239  0.006214 -0.024851
## BLR  1.00000 1.00000              0.034082  0.043534  0.012469
## EGBM 1.00000 1.00000   1.00000              0.009452 -0.021613
## GLM  0.07509 1.00000   1.00000   1.00000             -0.031065
## NNET 1.00000 0.65561   1.00000   1.00000   1.00000
trellis.par.set(theme1)
bwplot(difValues, layout = c(3, 1))

trellis.par.set(caretTheme())
dotplot(difValues)

Predicted Probabilities and ROC Curve

gbm.probs<-predict(model_gbm2,testSet,type="prob")

#svm.probs<-predict.train(object=model_svm,testSet[,Topfivepred],type="prob")


blr.probs<-predict(re[[1]],testSet,type="prob")

rf.probs<-predict(re[[3]],testSet,type="prob")

egbm.probs<-predict(re[[2]],testSet,type="prob")

glm.probs<-predict(model_glm,testSet,type="prob")

nnet.probs<-predict(model_nnet,testSet,type="prob")

#svm.probs returning NA

#svm.probs%>%head()

blr.probs%>%head()
#This fixes problem with names when classProbs=TRUE
#testSetglm<-if_else(testSet[,outcomeName]==0,"No","Yes")

#Data$y<-ifelse(Data$y=='no',0,1)

results=data.frame(label=testSetglm)

results$gbm<-gbm.probs[,"Yes"]

#results$svm<-svm.probs[,"Yes"]


results$blr<-blr.probs[,"Yes"]

results$rf<-rf.probs[,"Yes"]

results$egbm<-egbm.probs[,"Yes"]

results$glm<-glm.probs[,"Yes"]

results$nnet<-nnet.probs[,"Yes"]

results%>%head()
Lift Curves

The lift function can be used to evaluate probabilities thresholds that can capture a certain percentage of hits. The function requires a set of sample probability predictions (not from the training set) and the true class labels.The lift function does the calculations and the corresponding plot function is used to plot the lift curve (although some call this the gain curve).

trellis.par.set(caretTheme())
                                          

#lift_obj <- caret::lift(label ~ gbm  + blr+rf+egbm+glm+nnet, data = results)
#save(lift_obj,file="lift_obj.RData")
 
load("lift_obj.RData")
ggplot(lift_obj, values = 60)

From this we can see that, to find 60 percent of the hits, approximately 62 percent of the data can be sampled (when ordered by the probability predictions). The boosting logistic regression model does somewhat worse than the other five models.

plot(lift_obj, values = 60, auto.key = list(columns = 3,
                                           lines = TRUE,
                                            points = FALSE))

Calibration Curves

Calibration curves can be used to characterisze how consistent the predicted class probabilities are with the observed event rates.

trellis.par.set(caretTheme())

cal_obj <- caret::calibration(label ~ gbm  + blr+rf+egbm+glm+nnet, data = results,
                       cuts = 13)

plot(cal_obj, type = "l", auto.key = list(columns = 3,
                                          lines = TRUE,
                                          points = FALSE))

ggplot(cal_obj)

Using the lime package to visualize predictors The lime package provides weighted linear combinations of predictors that help explain the model built.

# Create an explainer object
#explainer <- lime(trainSet, model_nnet)

# Explain new observation
#explanation <- explain(testSet, explainer, n_labels = 1, n_features = 5)

# The output is provided in a consistent tabular format and includes the
# output from the model.
#head(explanation)

# And can be visualised directly
#plot_features(explanation)
stopImplicitCluster()