The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).
Data available at UCI machine learning repository available here
Citation Request: [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014
Input variables: The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (or not) subscribed.
The classification goal is to predict if the client will subscribe a term deposit (variable y).
Number of Instances: 45211 for bank-full.csv (4521 for bank.csv)
Number of Attributes: 16 + output attribute.
Attribute information:
** Input variables:**
** bank client data description:**
** Summary ** The most important predictors of the Bank loan Term Deposit Subscription are, successful outcome of the previous marketing campaign, last contact duration, a person with a housing loan, if the last contact month of year is March. In general The higher last contact duration, successful outcome of the previous marketing campaign and contacting people in march, the higher the odds of success that a person wil subscribe to a loan term deposit. On the other people with contact unknown and a housing loan are less likely to subscribe to a banking deposit loan.
related with the last contact of the current campaign:
Output variable (desired target): 17. - y - has the client subscribed a term deposit? (binary: “yes”,“no”)
pacman::p_load("lubridate", "dplyr", "magrittr")
library(rio)
library(doParallel)
library(viridis)
library(RColorBrewer)
library(tidyverse)
library(ggthemes)
library(knitr)
library(tidyverse)
library(caret)
library(caretEnsemble)
library(plotly)
library(lime)
library(plotROC)
library(pROC)
require(xgboost)
require(Matrix)
# Calculate the number of cores
no_cores <- detectCores() - 1
cl<-makeCluster(no_cores)
registerDoParallel(cl)
setwd("/Users/nanaakwasiabayieboateng/Documents/memphisclassesbooks/DataMiningscience")
#load excel file with rio
Data<- rio::import("/Users/nanaakwasiabayieboateng/Documents/memphisclassesbooks/DataMiningscience/Anomalydetection/bank/bank-full.csv")
Data%>%head
#%>%kable()
summary(Data)
## age job marital education
## Min. :18.00 Length:45211 Length:45211 Length:45211
## 1st Qu.:33.00 Class :character Class :character Class :character
## Median :39.00 Mode :character Mode :character Mode :character
## Mean :40.94
## 3rd Qu.:48.00
## Max. :95.00
## default balance housing loan
## Length:45211 Min. : -8019 Length:45211 Length:45211
## Class :character 1st Qu.: 72 Class :character Class :character
## Mode :character Median : 448 Mode :character Mode :character
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
## contact day month duration
## Length:45211 Min. : 1.00 Length:45211 Min. : 0.0
## Class :character 1st Qu.: 8.00 Class :character 1st Qu.: 103.0
## Mode :character Median :16.00 Mode :character Median : 180.0
## Mean :15.81 Mean : 258.2
## 3rd Qu.:21.00 3rd Qu.: 319.0
## Max. :31.00 Max. :4918.0
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.0 Min. : 0.0000 Length:45211
## 1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000 Class :character
## Median : 2.000 Median : -1.0 Median : 0.0000 Mode :character
## Mean : 2.764 Mean : 40.2 Mean : 0.5803
## 3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :63.000 Max. :871.0 Max. :275.0000
## y
## Length:45211
## Class :character
## Mode :character
##
##
##
There are no missing observations in the data.
#==================================================================
#check the number of missing rows
#==================================================================
colSums(is.na.data.frame(Data))
## age job marital education default balance housing
## 0 0 0 0 0 0 0
## loan contact day month duration campaign pdays
## 0 0 0 0 0 0 0
## previous poutcome y
## 0 0 0
#==================================================================
# descriptive/summary statistics
#==================================================================
Hmisc::describe.data.frame(Data)
## Data
##
## 17 Variables 45211 Observations
## ---------------------------------------------------------------------------
## age
## n missing distinct Info Mean Gmd .05 .10
## 45211 0 77 0.999 40.94 11.87 27 29
## .25 .50 .75 .90 .95
## 33 39 48 56 59
##
## lowest : 18 19 20 21 22, highest: 90 92 93 94 95
## ---------------------------------------------------------------------------
## job
## n missing distinct
## 45211 0 12
##
## admin. (5171, 0.114), blue-collar (9732, 0.215), entrepreneur (1487,
## 0.033), housemaid (1240, 0.027), management (9458, 0.209), retired (2264,
## 0.050), self-employed (1579, 0.035), services (4154, 0.092), student (938,
## 0.021), technician (7597, 0.168), unemployed (1303, 0.029), unknown (288,
## 0.006)
## ---------------------------------------------------------------------------
## marital
## n missing distinct
## 45211 0 3
##
## Value divorced married single
## Frequency 5207 27214 12790
## Proportion 0.115 0.602 0.283
## ---------------------------------------------------------------------------
## education
## n missing distinct
## 45211 0 4
##
## Value primary secondary tertiary unknown
## Frequency 6851 23202 13301 1857
## Proportion 0.152 0.513 0.294 0.041
## ---------------------------------------------------------------------------
## default
## n missing distinct
## 45211 0 2
##
## Value no yes
## Frequency 44396 815
## Proportion 0.982 0.018
## ---------------------------------------------------------------------------
## balance
## n missing distinct Info Mean Gmd .05 .10
## 45211 0 7168 1 1362 2054 -172 0
## .25 .50 .75 .90 .95
## 72 448 1428 3574 5768
##
## lowest : -8019 -6847 -4057 -3372 -3313, highest: 66721 71188 81204 98417 102127
## ---------------------------------------------------------------------------
## housing
## n missing distinct
## 45211 0 2
##
## Value no yes
## Frequency 20081 25130
## Proportion 0.444 0.556
## ---------------------------------------------------------------------------
## loan
## n missing distinct
## 45211 0 2
##
## Value no yes
## Frequency 37967 7244
## Proportion 0.84 0.16
## ---------------------------------------------------------------------------
## contact
## n missing distinct
## 45211 0 3
##
## Value cellular telephone unknown
## Frequency 29285 2906 13020
## Proportion 0.648 0.064 0.288
## ---------------------------------------------------------------------------
## day
## n missing distinct Info Mean Gmd .05 .10
## 45211 0 31 0.999 15.81 9.576 3 5
## .25 .50 .75 .90 .95
## 8 16 21 28 29
##
## lowest : 1 2 3 4 5, highest: 27 28 29 30 31
## ---------------------------------------------------------------------------
## month
## n missing distinct
## 45211 0 12
##
## Value apr aug dec feb jan jul jun mar may nov
## Frequency 2932 6247 214 2649 1403 6895 5341 477 13766 3970
## Proportion 0.065 0.138 0.005 0.059 0.031 0.153 0.118 0.011 0.304 0.088
##
## Value oct sep
## Frequency 738 579
## Proportion 0.016 0.013
## ---------------------------------------------------------------------------
## duration
## n missing distinct Info Mean Gmd .05 .10
## 45211 0 1573 1 258.2 235.4 35 58
## .25 .50 .75 .90 .95
## 103 180 319 548 751
##
## lowest : 0 1 2 3 4, highest: 3366 3422 3785 3881 4918
## ---------------------------------------------------------------------------
## campaign
## n missing distinct Info Mean Gmd .05 .10
## 45211 0 48 0.918 2.764 2.383 1 1
## .25 .50 .75 .90 .95
## 1 2 3 5 8
##
## lowest : 1 2 3 4 5, highest: 50 51 55 58 63
## ---------------------------------------------------------------------------
## pdays
## n missing distinct Info Mean Gmd .05 .10
## 45211 0 559 0.454 40.2 71.61 -1 -1
## .25 .50 .75 .90 .95
## -1 -1 -1 185 317
##
## lowest : -1 1 2 3 4, highest: 838 842 850 854 871
## ---------------------------------------------------------------------------
## previous
## n missing distinct Info Mean Gmd .05 .10
## 45211 0 41 0.454 0.5803 1.044 0 0
## .25 .50 .75 .90 .95
## 0 0 0 2 3
##
## lowest : 0 1 2 3 4, highest: 41 51 55 58 275
## ---------------------------------------------------------------------------
## poutcome
## n missing distinct
## 45211 0 4
##
## Value failure other success unknown
## Frequency 4901 1840 1511 36959
## Proportion 0.108 0.041 0.033 0.817
## ---------------------------------------------------------------------------
## y
## n missing distinct
## 45211 0 2
##
## Value no yes
## Frequency 39922 5289
## Proportion 0.883 0.117
## ---------------------------------------------------------------------------
#describe(Data)
#==================================================================
# Histograms
#==================================================================
theme_set(theme_economist_white())
#ggplot(Data) + geom_boxplot(aes(x =age,y=duration,color=y))
ggplot(Data, aes(x ="",y=age, fill=y))+ geom_boxplot()+labs(x="age",y="")
#ggplotly(p)
ggplot(Data, aes(x =duration, fill=y))+ geom_histogram(bins = 30)
#ggplotly()
ggplot(Data, aes(x =age, fill=y))+ geom_histogram(bins = 30)
# ggplotly()
ggplot(Data, aes(x =day, fill=y))+ geom_histogram(bins = 30)
#ggplotly()
ggplot(Data, aes(x =balance, fill=y))+ geom_histogram(bins = 30)
#ggplotly()
ggplot(Data, aes(x =age, fill=y))+ geom_histogram(bins = 30)
# ggplotly()
# geom_density(alpha=1/3,color="red") + scale_fill_hue()
ggplot(Data, aes(x=age, fill=y)) + geom_density(alpha=1/3) + scale_fill_hue()
ggplot(Data, aes(x=marital,previous)) +geom_violin()
ggplot(Data, aes(x=education,previous)) +geom_violin()
#ggplot(Data, aes(x="",y =loan, fill=y)) + geom_histogram()
ggplot(Data, aes(x ="",y=age, fill=y))+ geom_violin(adjust = .5,draw_quantiles = c(0.25, 0.5, 0.75))+labs(x="age",y="")
ggplotly()
Convert character variables to factor variables.This is neccessary for the caret package to train the models we are interested in later.
theme_set(theme_bw())
Data<-Data %>% mutate_if(is.character, as.factor)
str(Data)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
ggplot(Data, aes(x =y))+ geom_histogram(stat="count")+labs(x="Term Deposit")
#==================================================================
#Converting outcome variable to numeric
#==================================================================
Data$y<-ifelse(Data$y=='no',0,1)
str(Data)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ y : num 0 0 0 0 0 0 0 0 0 0 ...
glimpse(Data)
## Observations: 45,211
## Variables: 17
## $ age <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, ...
## $ job <fctr> management, technician, entrepreneur, blue-collar, ...
## $ marital <fctr> married, single, married, married, single, married,...
## $ education <fctr> tertiary, secondary, secondary, unknown, unknown, t...
## $ default <fctr> no, no, no, no, no, no, no, yes, no, no, no, no, no...
## $ balance <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 39...
## $ housing <fctr> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, ye...
## $ loan <fctr> no, no, yes, no, no, no, yes, no, no, no, no, no, n...
## $ contact <fctr> unknown, unknown, unknown, unknown, unknown, unknow...
## $ day <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ month <fctr> may, may, may, may, may, may, may, may, may, may, m...
## $ duration <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 1...
## $ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, ...
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ poutcome <fctr> unknown, unknown, unknown, unknown, unknown, unknow...
## $ y <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
Convert Categorical varialbes to dummy variables using either Model.matrix or sparse.model,matrix. The \(-1\) in the formula removes the first column which is all ones.
predictors<-setdiff(names(Data),Data$y)
#names(Data)
predictors<-names(Data[,-17])
#paste0(predictors,sep="",collapse = "+ ")
#paste("~",paste0(predictors,sep="",collapse = "+ "))
as.formula(paste("y~",paste0(predictors,sep="",collapse = "+ ")))
## y ~ age + job + marital + education + default + balance + housing +
## loan + contact + day + month + duration + campaign + pdays +
## previous + poutcome
d1<-model.matrix(as.formula(paste("y~",paste0(predictors,sep="",collapse = "+ "))), Data)
d1b<-Matrix::sparse.model.matrix(as.formula(paste("~",paste0(predictors,sep="",collapse = "+ "))), Data)
#d1<-model.matrix(~age+job+marital+education + default + balance + housing +
# loan + contact + day + month + duration + campaign + pdays +
# previous + poutcome,Data)
#d1b<-Matrix::sparse.model.matrix(~age+job+marital+education + default + balance + housing +
# loan + contact + day + month + duration + campaign + pdays +
# previous + poutcome,Data)
head(d1)
## (Intercept) age jobblue-collar jobentrepreneur jobhousemaid
## 1 1 58 0 0 0
## 2 1 44 0 0 0
## 3 1 33 0 1 0
## 4 1 47 1 0 0
## 5 1 33 0 0 0
## 6 1 35 0 0 0
## jobmanagement jobretired jobself-employed jobservices jobstudent
## 1 1 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 1 0 0 0 0
## jobtechnician jobunemployed jobunknown maritalmarried maritalsingle
## 1 0 0 0 1 0
## 2 1 0 0 0 1
## 3 0 0 0 1 0
## 4 0 0 0 1 0
## 5 0 0 1 0 1
## 6 0 0 0 1 0
## educationsecondary educationtertiary educationunknown defaultyes balance
## 1 0 1 0 0 2143
## 2 1 0 0 0 29
## 3 1 0 0 0 2
## 4 0 0 1 0 1506
## 5 0 0 1 0 1
## 6 0 1 0 0 231
## housingyes loanyes contacttelephone contactunknown day monthaug monthdec
## 1 1 0 0 1 5 0 0
## 2 1 0 0 1 5 0 0
## 3 1 1 0 1 5 0 0
## 4 1 0 0 1 5 0 0
## 5 0 0 0 1 5 0 0
## 6 1 0 0 1 5 0 0
## monthfeb monthjan monthjul monthjun monthmar monthmay monthnov monthoct
## 1 0 0 0 0 0 1 0 0
## 2 0 0 0 0 0 1 0 0
## 3 0 0 0 0 0 1 0 0
## 4 0 0 0 0 0 1 0 0
## 5 0 0 0 0 0 1 0 0
## 6 0 0 0 0 0 1 0 0
## monthsep duration campaign pdays previous poutcomeother poutcomesuccess
## 1 0 261 1 -1 0 0 0
## 2 0 151 1 -1 0 0 0
## 3 0 76 1 -1 0 0 0
## 4 0 92 1 -1 0 0 0
## 5 0 198 1 -1 0 0 0
## 6 0 139 1 -1 0 0 0
## poutcomeunknown
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
head(d1b)
## 6 x 43 sparse Matrix of class "dgCMatrix"
## [[ suppressing 43 column names '(Intercept)', 'age', 'jobblue-collar' ... ]]
##
## 1 1 58 . . . 1 . . . . . . . 1 . . 1 . . 2143 1 . . 1 5 . . . . . . . 1 .
## 2 1 44 . . . . . . . . 1 . . . 1 1 . . . 29 1 . . 1 5 . . . . . . . 1 .
## 3 1 33 . 1 . . . . . . . . . 1 . 1 . . . 2 1 1 . 1 5 . . . . . . . 1 .
## 4 1 47 1 . . . . . . . . . . 1 . . . 1 . 1506 1 . . 1 5 . . . . . . . 1 .
## 5 1 33 . . . . . . . . . . 1 . 1 . . 1 . 1 . . . 1 5 . . . . . . . 1 .
## 6 1 35 . . . 1 . . . . . . . 1 . . 1 . . 231 1 . . 1 5 . . . . . . . 1 .
##
## 1 . . 261 1 -1 . . . 1
## 2 . . 151 1 -1 . . . 1
## 3 . . 76 1 -1 . . . 1
## 4 . . 92 1 -1 . . . 1
## 5 . . 198 1 -1 . . . 1
## 6 . . 139 1 -1 . . . 1
Next step, we will transform the categorical data to dummy variables. This is the one-hot encoding step.
The purpose is to transform each value of each categorical feature in a binary feature {0, 1}. The dummy conversion results in 42 variables.
#==================================================================
#convert categorical variables to numeric variables
#==================================================================
dmy <- dummyVars(" ~ .", data = Data,fullRank = T)
transformed <- data.frame(predict(dmy, newdata =Data))
#Checking the structure of transformed train file
str(transformed)
## 'data.frame': 45211 obs. of 43 variables:
## $ age : num 58 44 33 47 33 35 28 42 58 43 ...
## $ job.blue.collar : num 0 0 0 1 0 0 0 0 0 0 ...
## $ job.entrepreneur : num 0 0 1 0 0 0 0 1 0 0 ...
## $ job.housemaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ job.management : num 1 0 0 0 0 1 1 0 0 0 ...
## $ job.retired : num 0 0 0 0 0 0 0 0 1 0 ...
## $ job.self.employed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ job.services : num 0 0 0 0 0 0 0 0 0 0 ...
## $ job.student : num 0 0 0 0 0 0 0 0 0 0 ...
## $ job.technician : num 0 1 0 0 0 0 0 0 0 1 ...
## $ job.unemployed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ job.unknown : num 0 0 0 0 1 0 0 0 0 0 ...
## $ marital.married : num 1 0 1 1 0 1 0 0 1 0 ...
## $ marital.single : num 0 1 0 0 1 0 1 0 0 1 ...
## $ education.secondary: num 0 1 1 0 0 0 0 0 0 1 ...
## $ education.tertiary : num 1 0 0 0 0 1 1 1 0 0 ...
## $ education.unknown : num 0 0 0 1 1 0 0 0 0 0 ...
## $ default.yes : num 0 0 0 0 0 0 0 1 0 0 ...
## $ balance : num 2143 29 2 1506 1 ...
## $ housing.yes : num 1 1 1 1 0 1 1 1 1 1 ...
## $ loan.yes : num 0 0 1 0 0 0 1 0 0 0 ...
## $ contact.telephone : num 0 0 0 0 0 0 0 0 0 0 ...
## $ contact.unknown : num 1 1 1 1 1 1 1 1 1 1 ...
## $ day : num 5 5 5 5 5 5 5 5 5 5 ...
## $ month.aug : num 0 0 0 0 0 0 0 0 0 0 ...
## $ month.dec : num 0 0 0 0 0 0 0 0 0 0 ...
## $ month.feb : num 0 0 0 0 0 0 0 0 0 0 ...
## $ month.jan : num 0 0 0 0 0 0 0 0 0 0 ...
## $ month.jul : num 0 0 0 0 0 0 0 0 0 0 ...
## $ month.jun : num 0 0 0 0 0 0 0 0 0 0 ...
## $ month.mar : num 0 0 0 0 0 0 0 0 0 0 ...
## $ month.may : num 1 1 1 1 1 1 1 1 1 1 ...
## $ month.nov : num 0 0 0 0 0 0 0 0 0 0 ...
## $ month.oct : num 0 0 0 0 0 0 0 0 0 0 ...
## $ month.sep : num 0 0 0 0 0 0 0 0 0 0 ...
## $ duration : num 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : num 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : num -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : num 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome.other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome.success : num 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome.unknown : num 1 1 1 1 1 1 1 1 1 1 ...
## $ y : num 0 0 0 0 0 0 0 0 0 0 ...
#==================================================================
#Converting the dependent variable back to categorical
#==================================================================
transformed$y<-as.factor(transformed$y)
#save(transformed,file="transformed.RData")
load("transformed.RData")
#==================================================================
#Spliting training set into two parts based on outcome: 70% and 30%
#==================================================================
index<-createDataPartition(transformed$y,p=0.70, list=FALSE)
trainSet<-transformed[index,]
testSet<-transformed[-index,]
outcomeName<-'y'
predictors<-names(trainSet)[!names(trainSet) %in% outcomeName]
predictors
## [1] "age" "jobblue.collar" "jobentrepreneur"
## [4] "jobhousemaid" "jobmanagement" "jobretired"
## [7] "jobself.employed" "jobservices" "jobstudent"
## [10] "jobtechnician" "jobunemployed" "jobunknown"
## [13] "maritalmarried" "maritalsingle" "educationsecondary"
## [16] "educationtertiary" "educationunknown" "defaultyes"
## [19] "balance" "housingyes" "loanyes"
## [22] "contacttelephone" "contactunknown" "day"
## [25] "monthaug" "monthdec" "monthfeb"
## [28] "monthjan" "monthjul" "monthjun"
## [31] "monthmar" "monthmay" "monthnov"
## [34] "monthoct" "monthsep" "duration"
## [37] "campaign" "pdays" "previous"
## [40] "poutcomeother" "poutcomesuccess" "poutcomeunknown"
dim(trainSet)
## [1] 31648 43
dim(testSet)
## [1] 13563 43
The training data has 31649 rows and 43 columns whereas the test data has 13562 rows and 43 columns. The number of features 43, is soo high that feature engineering is neccessary to reduce overfitting and complexity of the models we are going to build.
Recursive Feature Selection
#==================================================================
#Feature selection using rfe in caret(recursive feature extraction)
#predictors<-names(trainSet)[!names(trainSet) %in% outcomeName]
#Alternatively
#predictors<-setdiff(names(trainSet),outcomeName)
#==================================================================
library(randomForest)
# control <- rfeControl(functions = rfFuncs,
# method = "repeatedcv",
# repeats = 3,
# verbose = FALSE,
# allowParallel = TRUE)
# outcomeName<-'y'
#
# predictors<-names(trainSet)[!names(trainSet) %in% outcomeName]
#
# feature_select <- rfe(trainSet[,predictors], trainSet[,outcomeName],
# rfeControl = control)
#save(feature_select,file="feature_select.RData")
load("feature_select.RData")
table(trainSet$y)
##
## 0 1
## 27935 3713
feature_select
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold, repeated 3 times)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 4 0.9012 0.3913 0.003364 0.02533
## 8 0.9000 0.3538 0.003068 0.02860
## 16 0.9048 0.4450 0.003179 0.01995 *
## 42 0.9047 0.4318 0.004349 0.02868
##
## The top 5 variables (out of 16):
## duration, poutcomesuccess, monthmar, contactunknown, housingyes
names(feature_select)
## [1] "pred" "variables" "results" "bestSubset"
## [5] "fit" "optVariables" "optsize" "call"
## [9] "control" "resample" "metric" "maximize"
## [13] "perfNames" "times" "resampledCM" "obsLevels"
## [17] "dots"
predictors(feature_select)
## [1] "duration" "poutcomesuccess" "monthmar"
## [4] "contactunknown" "housingyes" "age"
## [7] "monthoct" "day" "monthsep"
## [10] "pdays" "monthjul" "monthaug"
## [13] "monthmay" "monthjun" "monthdec"
## [16] "campaign"
summary(feature_select)
## Length Class Mode
## pred 0 -none- NULL
## variables 6 data.frame list
## results 5 data.frame list
## bestSubset 1 -none- numeric
## fit 18 randomForest list
## optVariables 16 -none- character
## optsize 1 -none- numeric
## call 4 -none- call
## control 14 -none- list
## resample 8 data.frame list
## metric 1 -none- character
## maximize 1 -none- logical
## perfNames 2 -none- character
## times 3 -none- list
## resampledCM 0 -none- NULL
## obsLevels 2 -none- character
## dots 0 -none- list
feature_select[ "bestSubset"]
## $bestSubset
## [1] 16
#The top 5 variables (out of 42):
#print("The top 5 variables (out of 42)\n")
cat("The top 5 variables (out of 42)\n")
## The top 5 variables (out of 42)
cat("duration, poutcomesuccess, monthmar, contactunknown, housingyes\n")
## duration, poutcomesuccess, monthmar, contactunknown, housingyes
predictors(feature_select)
## [1] "duration" "poutcomesuccess" "monthmar"
## [4] "contactunknown" "housingyes" "age"
## [7] "monthoct" "day" "monthsep"
## [10] "pdays" "monthjul" "monthaug"
## [13] "monthmay" "monthjun" "monthdec"
## [16] "campaign"
#===================================================================================
# plot variable selection
#===================================================================================
trellis.par.set(caretTheme())
plot(feature_select, type = c( "o","g"))
The top 5 variables provides an accuracy of about 90% for the data. The remaining 36 variables add less than 0.1 . This is the advantage of feature engineering. It helps to reduce complexity in the model, reduce overfitting and also computationaly time.
#===================================================================================
#Taking only the top 5 predictors
#Age, Employment.Primarily.retired..or, Education.10th.Grade,
#Employment.Unable.to.work., race_black.Yes from feature_select
# Cs function from Hmisc converts names to character variables
#===================================================================================
library(Hmisc)
p=c(paste0(predictors(feature_select),sep=",",collapse = ""))
#trainSet[,]
predictor=Hmisc::Cs(duration,poutcome.success,month.mar,contact.unknown,age,housing.yes,day,month.oct,pdays,month.sep,month.jul,month.dec,month.may,month.aug,campaign,month.jun )
print("The set of features selected is:\n")
## [1] "The set of features selected is:\n"
p
## [1] "duration,poutcomesuccess,monthmar,contactunknown,housingyes,age,monthoct,day,monthsep,pdays,monthjul,monthaug,monthmay,monthjun,monthdec,campaign,"
#trainSet[,outcomeName]%>%head()
#trainSet[,outcomeName]
Topfivepred = Hmisc::Cs(duration, poutcomesuccess, monthmar, contactunknown, housingyes)
names(trainSet)
## [1] "age" "jobblue.collar" "jobentrepreneur"
## [4] "jobhousemaid" "jobmanagement" "jobretired"
## [7] "jobself.employed" "jobservices" "jobstudent"
## [10] "jobtechnician" "jobunemployed" "jobunknown"
## [13] "maritalmarried" "maritalsingle" "educationsecondary"
## [16] "educationtertiary" "educationunknown" "defaultyes"
## [19] "balance" "housingyes" "loanyes"
## [22] "contacttelephone" "contactunknown" "day"
## [25] "monthaug" "monthdec" "monthfeb"
## [28] "monthjan" "monthjul" "monthjun"
## [31] "monthmar" "monthmay" "monthnov"
## [34] "monthoct" "monthsep" "duration"
## [37] "campaign" "pdays" "previous"
## [40] "poutcomeother" "poutcomesuccess" "poutcomeunknown"
## [43] "y"
Logistic Regression Model
# trainSet$y<-as.factor(trainSet$y)
#
# # Set up training control
# ctrl <- trainControl(method = "repeatedcv", # 10fold cross validation
# number = 5, # do 5 repititions of cv
# summaryFunction=twoClassSummary,
# classProbs=TRUE,
# allowParallel = TRUE
# ) # Use AUC to pick the best model instead of default ,"Accuracy"
#
#
#
#
#
# #This fixes problem with names when classProbs=TRUE
# trainSetglm<-if_else(trainSet[,outcomeName]==0,"No","Yes")
#
#
# model_glm<-train(trainSet[,Topfivepred],trainSetglm,method='glm',family="binomial",trControl=ctrl,
# metric=c("ROC"))
#
# model_glm2<-train(trainSet[,Topfivepred],trainSet[,outcomeName],method='glm',family="binomial")
#
The accuracy of the logistic regression model about 90.4 %
#save(model_glm,file="model_glm.RData")
load("model_glm.RData")
#save(model_glm2,file="model_glm2.RData")
load("model_glm2.RData")
summary(model_glm2)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.6320 -0.4069 -0.2862 -0.1763 3.1529
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.989e+00 3.847e-02 -77.69 <2e-16 ***
## duration 3.994e-03 7.338e-05 54.42 <2e-16 ***
## poutcomesuccess 2.617e+00 7.374e-02 35.49 <2e-16 ***
## monthmar 2.175e+00 1.258e-01 17.29 <2e-16 ***
## contactunknown -1.301e+00 6.729e-02 -19.34 <2e-16 ***
## housingyes -7.931e-01 4.373e-02 -18.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 22845 on 31648 degrees of freedom
## Residual deviance: 16233 on 31643 degrees of freedom
## AIC: 16245
##
## Number of Fisher Scoring iterations: 6
# Predict using the test data
pred<-predict(model_glm,testSet)
my_data=data.frame(cbind(predicted=pred,observed=testSet$y))
ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('logistic model')
# Print, plot variable importance
print(varImp(model_glm, scale = FALSE))
## glm variable importance
##
## Overall
## duration 54.42
## poutcomesuccess 35.49
## contactunknown 19.34
## housingyes 18.14
## monthmar 17.29
plot(varImp(model_glm, scale = FALSE), main="Variable Importance using logistic/glm")
#This fixes problem with names when classProbs=TRUE
testSetglm<-if_else(testSet[,outcomeName]==0,"No","Yes")
confusionMatrix(testSetglm,pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 11716 271
## Yes 1061 515
##
## Accuracy : 0.9018
## 95% CI : (0.8967, 0.9068)
## No Information Rate : 0.942
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3888
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9170
## Specificity : 0.6552
## Pos Pred Value : 0.9774
## Neg Pred Value : 0.3268
## Prevalence : 0.9420
## Detection Rate : 0.8638
## Detection Prevalence : 0.8838
## Balanced Accuracy : 0.7861
##
## 'Positive' Class : No
##
We fit multiple algorithms at this step, we will compare the performance of these models by their accuracy measure. The four models we consider here are Boosted logistic regression,extreme gradient boosting,random forest and support vector machines with a radial kernel ,generalized partial least squares,bayesian regularized neural network (brnn).
#===================================================================================
# multiple algorithms
#===================================================================================
# Set up training control
fitControl <- trainControl(method = "repeatedcv", # 10 fold cross validation
number = 5, # do 5 repititions of cv
summaryFunction=twoClassSummary, # Use AUC to pick the best model
classProbs=TRUE,
allowParallel = TRUE)
#This fixes problem with names when classProbs=TRUE
trainSetglm<-if_else(trainSet[,outcomeName]==0,"No","Yes")
#
# met=c("LogitBoost", 'xgbTree', 'rf', 'gpls')
#
#
#
# re=list()
#
# for (i in seq_along(met)) {
#
# re[[i]]<-train(trainSet[,Topfivepred],trainSetglm,method=met[i],
#
# preProcess = c("center","scale"),trControl=fitControl,metric = "ROC")
# }
Save models to avoid running the model every time.
#save(re,file="multiplealgorithms.RData")
load("multiplealgorithms.RData")
#re[[1]]
#re[[2]]
#re[[3]]
#re[[4]]
#re%>%as.list.data.frame()%>%kable()
Boosted Logistic Regression
re[[1]]%>%head()
## $method
## [1] "LogitBoost"
##
## $modelInfo
## $modelInfo$label
## [1] "Boosted Logistic Regression"
##
## $modelInfo$library
## [1] "caTools"
##
## $modelInfo$loop
## function (grid)
## {
## loop <- grid[which.max(grid$nIter), , drop = FALSE]
## submodels <- grid[-which.max(grid$nIter), , drop = FALSE]
## submodels <- list(submodels)
## list(loop = loop, submodels = submodels)
## }
##
## $modelInfo$type
## [1] "Classification"
##
## $modelInfo$parameters
## parameter class label
## 1 nIter numeric # Boosting Iterations
##
## $modelInfo$grid
## function (x, y, len = NULL, search = "grid")
## {
## if (search == "grid") {
## out <- data.frame(nIter = 1 + ((1:len) * 10))
## }
## else {
## out <- data.frame(nIter = unique(sample(1:100, size = len,
## replace = TRUE)))
## }
## out
## }
##
## $modelInfo$fit
## function (x, y, wts, param, lev, last, classProbs, ...)
## {
## caTools::LogitBoost(as.matrix(x), y, nIter = param$nIter)
## }
##
## $modelInfo$predict
## function (modelFit, newdata, submodels = NULL)
## {
## out <- caTools::predict.LogitBoost(modelFit, newdata, type = "class")
## if (!is.null(submodels)) {
## tmp <- out
## out <- vector(mode = "list", length = nrow(submodels) +
## 1)
## out[[1]] <- tmp
## for (j in seq(along = submodels$nIter)) {
## out[[j + 1]] <- caTools::predict.LogitBoost(modelFit,
## newdata, nIter = submodels$nIter[j])
## }
## }
## out
## }
##
## $modelInfo$prob
## function (modelFit, newdata, submodels = NULL)
## {
## out <- caTools::predict.LogitBoost(modelFit, newdata, type = "raw")
## out <- t(apply(out, 1, function(x) x/sum(x)))
## if (!is.null(submodels)) {
## tmp <- vector(mode = "list", length = nrow(submodels) +
## 1)
## tmp[[1]] <- out
## for (j in seq(along = submodels$nIter)) {
## tmpProb <- caTools::predict.LogitBoost(modelFit,
## newdata, type = "raw", nIter = submodels$nIter[j])
## tmpProb <- out <- t(apply(tmpProb, 1, function(x) x/sum(x)))
## tmp[[j + 1]] <- as.data.frame(tmpProb[, modelFit$obsLevels,
## drop = FALSE])
## }
## out <- tmp
## }
## out
## }
##
## $modelInfo$predictors
## function (x, ...)
## {
## if (!is.null(x$xNames)) {
## out <- unique(x$xNames[x$Stump[, "feature"]])
## }
## else out <- NA
## out
## }
##
## $modelInfo$levels
## function (x)
## x$obsLevels
##
## $modelInfo$tags
## [1] "Ensemble Model" "Boosting"
## [3] "Implicit Feature Selection" "Tree-Based Model"
## [5] "Logistic Regression"
##
## $modelInfo$sort
## function (x)
## x[order(x[, 1]), ]
##
##
## $modelType
## [1] "Classification"
##
## $results
## nIter ROC Sens Spec ROCSD SensSD SpecSD
## 1 11 0.8321840 0.9721247 0.3243221 0.005703393 0.009006722 0.06403445
## 2 21 0.8437131 0.9721609 0.3153970 0.018153601 0.009202690 0.05757145
## 3 31 0.8439240 0.9619261 0.3665091 0.010009442 0.019781419 0.09638093
##
## $pred
## NULL
##
## $bestTune
## nIter
## 3 31
re[[1]][[4]]%>%kable()
nIter | ROC | Sens | Spec | ROCSD | SensSD | SpecSD |
---|---|---|---|---|---|---|
11 | 0.8321840 | 0.9721247 | 0.3243221 | 0.0057034 | 0.0090067 | 0.0640345 |
21 | 0.8437131 | 0.9721609 | 0.3153970 | 0.0181536 | 0.0092027 | 0.0575714 |
31 | 0.8439240 | 0.9619261 | 0.3665091 | 0.0100094 | 0.0197814 | 0.0963809 |
plot(re[[1]])
varImp(object=re[[1]])
## ROC curve variable importance
##
## Importance
## duration 100.00
## housingyes 31.53
## contactunknown 30.61
## poutcomesuccess 22.39
## monthmar 0.00
# Predict using the test data
pred<-predict(re[[1]],testSet)
my_data=data.frame(cbind(predicted=pred,observed=testSet$y))
ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('Boosted Logistic Regression')
# Print, plot variable importance
print(varImp(re[[1]], scale = FALSE))
## ROC curve variable importance
##
## Importance
## duration 0.8045
## housingyes 0.6103
## contactunknown 0.6077
## poutcomesuccess 0.5844
## monthmar 0.5209
plot(varImp(re[[1]], scale = FALSE), main="Boosted Logistic Regression")
#This fixes problem with names when classProbs=TRUE
testSetglm<-if_else(testSet[,outcomeName]==0,"No","Yes")
confusionMatrix(testSetglm,pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 11407 580
## Yes 1573 3
##
## Accuracy : 0.8413
## 95% CI : (0.835, 0.8474)
## No Information Rate : 0.957
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.064
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.878814
## Specificity : 0.005146
## Pos Pred Value : 0.951614
## Neg Pred Value : 0.001904
## Prevalence : 0.957015
## Detection Rate : 0.841038
## Detection Prevalence : 0.883802
## Balanced Accuracy : 0.441980
##
## 'Positive' Class : No
##
#other model plot
trellis.par.set(caretTheme())
#plot(re[[1]],plotType = "level",scales=list(x=list(rot=90)))
EXtreme Gradient Boosting
#re[[2]]
re[[2]][[4]]%>%head()%>%kable()
eta | max_depth | gamma | colsample_bytree | min_child_weight | subsample | nrounds | ROC | Sens | Spec | ROCSD | SensSD | SpecSD | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.3 | 1 | 0 | 0.6 | 1 | 0.50 | 50 | 0.8543912 | 0.9743433 | 0.3424142 | 0.0113690 | 0.0028483 | 0.0352174 |
4 | 0.3 | 1 | 0 | 0.6 | 1 | 0.75 | 50 | 0.8544852 | 0.9755958 | 0.3329675 | 0.0106884 | 0.0013653 | 0.0137451 |
7 | 0.3 | 1 | 0 | 0.6 | 1 | 1.00 | 50 | 0.8543956 | 0.9767766 | 0.3235208 | 0.0110311 | 0.0012870 | 0.0177499 |
10 | 0.3 | 1 | 0 | 0.8 | 1 | 0.50 | 50 | 0.8764375 | 0.9745581 | 0.3307991 | 0.0103923 | 0.0044108 | 0.0282519 |
13 | 0.3 | 1 | 0 | 0.8 | 1 | 0.75 | 50 | 0.8762603 | 0.9783152 | 0.3181158 | 0.0105884 | 0.0025703 | 0.0217097 |
16 | 0.3 | 1 | 0 | 0.8 | 1 | 1.00 | 50 | 0.8763053 | 0.9777784 | 0.3202794 | 0.0105829 | 0.0028286 | 0.0246302 |
plot(re[[2]])
varImp(object=re[[2]])
## xgbTree variable importance
##
## Overall
## duration 100.000
## poutcomesuccess 33.837
## contactunknown 21.513
## monthmar 7.581
## housingyes 0.000
# Predict using the test data
pred<-predict(re[[2]],testSet)
my_data=data.frame(cbind(predicted=pred,observed=testSet$y))
ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('EXtreme Gradient Boosting')
# Print, plot variable importance
print(varImp(re[[2]], scale = FALSE))
## xgbTree variable importance
##
## Overall
## duration 0.61376
## poutcomesuccess 0.20768
## contactunknown 0.13204
## monthmar 0.04653
## housingyes 0.00000
plot(varImp(re[[2]], scale = FALSE), main="EXtreme Gradient Boosting")
#This fixes problem with names when classProbs=TRUE
testSetglm<-if_else(testSet[,outcomeName]==0,"No","Yes")
confusionMatrix(testSetglm,pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 11971 16
## Yes 1565 11
##
## Accuracy : 0.8834
## 95% CI : (0.8779, 0.8888)
## No Information Rate : 0.998
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0098
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.88438
## Specificity : 0.40741
## Pos Pred Value : 0.99867
## Neg Pred Value : 0.00698
## Prevalence : 0.99801
## Detection Rate : 0.88262
## Detection Prevalence : 0.88380
## Balanced Accuracy : 0.64589
##
## 'Positive' Class : No
##
#other model plot
trellis.par.set(caretTheme())
#plot(re[[2]],plotType = "level",scales=list(x=list(rot=90)))
Random Forest
re[[3]]
## Random Forest
##
## 31649 samples
## 5 predictor
## 2 classes: 'No', 'Yes'
##
## Pre-processing: centered (5), scaled (5)
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 25319, 25319, 25319, 25319, 25320
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 2 0.7898092 0.9777786 0.3197410
## 3 0.8065867 0.9742718 0.3483627
## 5 0.8175060 0.9582409 0.3291888
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.
re[[3]][[4]]%>%kable()
mtry | ROC | Sens | Spec | ROCSD | SensSD | SpecSD |
---|---|---|---|---|---|---|
2 | 0.7898092 | 0.9777786 | 0.3197410 | 0.0059779 | 0.0017496 | 0.0067902 |
3 | 0.8065867 | 0.9742718 | 0.3483627 | 0.0021690 | 0.0031719 | 0.0194566 |
5 | 0.8175060 | 0.9582409 | 0.3291888 | 0.0083053 | 0.0016630 | 0.0133612 |
plot(re[[3]])
varImp(object=re[[3]])
## rf variable importance
##
## Overall
## duration 100.0000
## poutcomesuccess 17.3343
## housingyes 3.5630
## contactunknown 0.3452
## monthmar 0.0000
# Predict using the test data
pred<-predict(re[[3]],testSet)
my_data=data.frame(cbind(predicted=pred,observed=testSet$y))
ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('Random Forest')
# Print, plot variable importance
print(varImp(re[[3]], scale = FALSE))
## rf variable importance
##
## Overall
## duration 2803.2
## poutcomesuccess 577.7
## housingyes 206.9
## contactunknown 120.3
## monthmar 111.0
plot(varImp(re[[3]], scale = FALSE), main="Random Forest")
#This fixes problem with names when classProbs=TRUE
testSetglm<-if_else(testSet[,outcomeName]==0,"No","Yes")
confusionMatrix(testSetglm,pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 11722 265
## Yes 843 733
##
## Accuracy : 0.9183
## 95% CI : (0.9136, 0.9229)
## No Information Rate : 0.9264
## P-Value [Acc > NIR] : 0.9998
##
## Kappa : 0.5269
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9329
## Specificity : 0.7345
## Pos Pred Value : 0.9779
## Neg Pred Value : 0.4651
## Prevalence : 0.9264
## Detection Rate : 0.8643
## Detection Prevalence : 0.8838
## Balanced Accuracy : 0.8337
##
## 'Positive' Class : No
##
#other model plot
trellis.par.set(caretTheme())
#plot(re[[3]],plotType = "level",scales=list(x=list(rot=90)))
Support Vector Machines with Radial Basis Function Kernel
# Set up training control
fitControl <- trainControl(method = "repeatedcv", # 10 fold cross validation
number = 5, # do 5 repititions of cv
summaryFunction=twoClassSummary, # Use AUC to pick the best model
classProbs=TRUE,
allowParallel = TRUE)
#This fixes problem with names when classProbs=TRUE
trainSetglm<-if_else(trainSet[,outcomeName]==0,"No","Yes")
#model_svm<-train(trainSet[,Topfivepred],trainSetglm,method='svmRadial',
# trControl=fitControl,preProcess = c("center","scale"))
#
#predictions<-predict.train(object=re[[4]],testSet[,predictors],type="raw")
#predictions
#save(model_svm,file="model_svm.RData")
load("model_svm.RData")
model_svm
## Support Vector Machines with Radial Basis Function Kernel
##
## 31649 samples
## 5 predictor
## 2 classes: 'No', 'Yes'
##
## Pre-processing: centered (5), scaled (5)
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 25320, 25320, 25319, 25318, 25319
## Resampling results across tuning parameters:
##
## C ROC Sens Spec
## 0.25 0.6232691 0.9791866 0.3190709
## 0.50 0.7684297 0.9778859 0.3183791
## 1.00 0.7668203 0.9779217 0.3154112
##
## Tuning parameter 'sigma' was held constant at a value of 3.715887
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 3.715887 and C = 0.5.
ggplot(model_svm)
#other model plot
trellis.par.set(caretTheme())
#plot(model_svm,plotType = "level",scales=list(x=list(rot=90)))
varImp(object=model_svm)
## ROC curve variable importance
##
## Importance
## duration 100.00
## housingyes 31.53
## contactunknown 30.61
## poutcomesuccess 22.39
## monthmar 0.00
# Predict using the test data
#pred<-predict(model_svm,testSet)
#predictions<-predict(object=model_svm,testSet[,Topfivepred],type="raw")
#predictions<-caret::predict.train(object=model_svm,testSet[,Topfivepred],type="raw")
#my_data=data.frame(cbind(predicted=predictions,observed=testSet$y))
#ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('Support Vector Machines with Radial Basis Function Kernel')
# Print, plot variable importance
print(varImp(model_svm, scale = FALSE))
## ROC curve variable importance
##
## Importance
## duration 0.8045
## housingyes 0.6103
## contactunknown 0.6077
## poutcomesuccess 0.5844
## monthmar 0.5209
plot(varImp(model_svm, scale = FALSE), main="Support Vector Machines with Radial Basis Function Kernel")
# Print, plot variable importance
print(varImp(model_svm, scale = FALSE))
## ROC curve variable importance
##
## Importance
## duration 0.8045
## housingyes 0.6103
## contactunknown 0.6077
## poutcomesuccess 0.5844
## monthmar 0.5209
#This fixes problem with names when classProbs=TRUE
testSetglm<-if_else(testSet[,outcomeName]==0,"No","Yes")
#confusionMatrix(testSetglm,predictions)
#predictions%>%head()
#other model plot
trellis.par.set(caretTheme())
#plot(re[[3]],plotType = "level",scales=list(x=list(rot=90)))
Find models that are supported by the caret package. There are over 200 models that can be implemented in the caret package at last count.
names(getModelInfo())%>%head()
## [1] "AdaBag" "AdaBoost.M1" "adaboost" "ada" "amdai"
## [6] "ANFIS"
#===================================================================================
#gradient boosted trees
# parameter tuning
#===================================================================================
fitControl <- trainControl(method = "repeatedcv",number = 5,repeats = 5,
allowParallel = TRUE )
#repeat=5, 5 separate sample 5 fold cross validation because number =10
#fitControl <- trainControl(
# method = "repeatedcv",
# number = 5,classProbs = TRUE,
# repeats = 5,allowParallel = TRUE)
#Creating grid
#grid <- expand.grid(n.trees=c(10,20,50,100,500,1000), # Number of trees to fit,
# shrinkage=c(0.01,0.05,0.1,0.5),
# n.minobsinnode = c(3,5,10),
# interaction.depth=c(1,5,10) # Depth of variable interactions)
#
#
grid <- expand.grid(interaction.depth=c(1,5,10), # Depth of variable interactions
n.trees=c(10,20,50,100), # Number trees to fit
shrinkage=c(0.01,0.05,0.1,0.5), # Try 2 values for learning rate
n.minobsinnode = c(3,5,10))
# training the model
#model_gbm<-train(trainSet[,Topfivepred],trainSet[,outcomeName],method='gbm',trControl=fitControl,
# tuneGrid=grid,preProcess = c("center","scale"))
# summarizing the model
modelLookup(model='gbm')
#model_gbm%>%as.list.data.frame()%>%kable()
#save(model_gbm,file="model_gbm.RData")
load("model_gbm.RData")
print(model_gbm)%>%head()
## Stochastic Gradient Boosting
##
## 31649 samples
## 16 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 25320, 25320, 25318, 25319, 25319, 25319, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.minobsinnode n.trees Accuracy
## 0.01 1 3 10 0.8829979
## 0.01 1 3 20 0.8829979
## 0.01 1 3 50 0.8829979
## 0.01 1 3 100 0.8848242
## 0.01 1 3 500 0.8934753
## 0.01 1 3 1000 0.8984422
## 0.01 1 5 10 0.8829979
## 0.01 1 5 20 0.8829979
## 0.01 1 5 50 0.8829979
## 0.01 1 5 100 0.8847926
## 0.01 1 5 500 0.8931783
## 0.01 1 5 1000 0.8982463
## 0.01 1 10 10 0.8829979
## 0.01 1 10 20 0.8829979
## 0.01 1 10 50 0.8829979
## 0.01 1 10 100 0.8847736
## 0.01 1 10 500 0.8931846
## 0.01 1 10 1000 0.8982969
## 0.01 5 3 10 0.8829979
## 0.01 5 3 20 0.8829979
## 0.01 5 3 50 0.8829979
## 0.01 5 3 100 0.8839395
## 0.01 5 3 500 0.9031945
## 0.01 5 3 1000 0.9042624
## 0.01 5 5 10 0.8829979
## 0.01 5 5 20 0.8829979
## 0.01 5 5 50 0.8829979
## 0.01 5 5 100 0.8839269
## 0.01 5 5 500 0.9032956
## 0.01 5 5 1000 0.9042372
## 0.01 5 10 10 0.8829979
## 0.01 5 10 20 0.8829979
## 0.01 5 10 50 0.8829979
## 0.01 5 10 100 0.8839647
## 0.01 5 10 500 0.9032324
## 0.01 5 10 1000 0.9045215
## 0.01 10 3 10 0.8829979
## 0.01 10 3 20 0.8829979
## 0.01 10 3 50 0.8829979
## 0.01 10 3 100 0.8894309
## 0.01 10 3 500 0.9051219
## 0.01 10 3 1000 0.9055390
## 0.01 10 5 10 0.8829979
## 0.01 10 5 20 0.8829979
## 0.01 10 5 50 0.8829979
## 0.01 10 5 100 0.8891530
## 0.01 10 5 500 0.9051914
## 0.01 10 5 1000 0.9056780
## 0.01 10 10 10 0.8829979
## 0.01 10 10 20 0.8829979
## 0.01 10 10 50 0.8829979
## 0.01 10 10 100 0.8893614
## 0.01 10 10 500 0.9052546
## 0.01 10 10 1000 0.9059245
## 0.05 1 3 10 0.8829979
## 0.05 1 3 20 0.8849885
## 0.05 1 3 50 0.8880660
## 0.05 1 3 100 0.8936269
## 0.05 1 3 500 0.9009384
## 0.05 1 3 1000 0.9020001
## 0.05 1 5 10 0.8829979
## 0.05 1 5 20 0.8849695
## 0.05 1 5 50 0.8879965
## 0.05 1 5 100 0.8939429
## 0.05 1 5 500 0.9007804
## 0.05 1 5 1000 0.9021897
## 0.05 1 10 10 0.8829979
## 0.05 1 10 20 0.8850517
## 0.05 1 10 50 0.8879712
## 0.05 1 10 100 0.8934943
## 0.05 1 10 500 0.9007804
## 0.05 1 10 1000 0.9019495
## 0.05 5 3 10 0.8829979
## 0.05 5 3 20 0.8854181
## 0.05 5 3 50 0.9005530
## 0.05 5 3 100 0.9032576
## 0.05 5 3 500 0.9049134
## 0.05 5 3 1000 0.9048060
## 0.05 5 5 10 0.8829979
## 0.05 5 5 20 0.8852918
## 0.05 5 5 50 0.9009195
## 0.05 5 5 100 0.9030554
## 0.05 5 5 500 0.9054442
## 0.05 5 5 1000 0.9055200
## 0.05 5 10 10 0.8829979
## 0.05 5 10 20 0.8848116
## 0.05 5 10 50 0.9007299
## 0.05 5 10 100 0.9028722
## 0.05 5 10 500 0.9049892
## 0.05 5 10 1000 0.9048439
## 0.05 10 3 10 0.8829979
## 0.05 10 3 20 0.8904736
## 0.05 10 3 50 0.9027395
## 0.05 10 3 100 0.9045216
## 0.05 10 3 500 0.9056465
## 0.05 10 3 1000 0.9045279
## 0.05 10 5 10 0.8829979
## 0.05 10 5 20 0.8916996
## 0.05 10 5 50 0.9022655
## 0.05 10 5 100 0.9048249
## 0.05 10 5 500 0.9055832
## 0.05 10 5 1000 0.9042309
## 0.05 10 10 10 0.8829979
## 0.05 10 10 20 0.8912509
## 0.05 10 10 50 0.9025941
## 0.05 10 10 100 0.9046669
## 0.05 10 10 500 0.9056464
## 0.05 10 10 1000 0.9047112
## 0.10 1 3 10 0.8854056
## 0.10 1 3 20 0.8874214
## 0.10 1 3 50 0.8941199
## 0.10 1 3 100 0.8984107
## 0.10 1 3 500 0.9020633
## 0.10 1 3 1000 0.9020443
## 0.10 1 5 10 0.8850770
## 0.10 1 5 20 0.8872824
## 0.10 1 5 50 0.8941009
## 0.10 1 5 100 0.8987709
## 0.10 1 5 500 0.9020001
## 0.10 1 5 1000 0.9024614
## 0.10 1 10 10 0.8853550
## 0.10 1 10 20 0.8873330
## 0.10 1 10 50 0.8940566
## 0.10 1 10 100 0.8986066
## 0.10 1 10 500 0.9018674
## 0.10 1 10 1000 0.9021139
## 0.10 5 3 10 0.8875099
## 0.10 5 3 20 0.8986066
## 0.10 5 3 50 0.9026320
## 0.10 5 3 100 0.9039591
## 0.10 5 3 500 0.9047364
## 0.10 5 3 1000 0.9040160
## 0.10 5 5 10 0.8867073
## 0.10 5 5 20 0.8986951
## 0.10 5 5 50 0.9028469
## 0.10 5 5 100 0.9042245
## 0.10 5 5 500 0.9049323
## 0.10 5 5 1000 0.9037190
## 0.10 5 10 10 0.8875098
## 0.10 5 10 20 0.8979999
## 0.10 5 10 50 0.9031123
## 0.10 5 10 100 0.9040728
## 0.10 5 10 500 0.9048754
## 0.10 5 10 1000 0.9043572
## 0.10 10 3 10 0.8925590
## 0.10 10 3 20 0.9012102
## 0.10 10 3 50 0.9048123
## 0.10 10 3 100 0.9052546
## 0.10 10 3 500 0.9035611
## 0.10 10 3 1000 0.9017663
## 0.10 10 5 10 0.8929381
## 0.10 10 5 20 0.9013681
## 0.10 10 5 50 0.9047743
## 0.10 10 5 100 0.9053431
## 0.10 10 5 500 0.9038201
## 0.10 10 5 1000 0.9018927
## 0.10 10 10 10 0.8928560
## 0.10 10 10 20 0.9011028
## 0.10 10 10 50 0.9039907
## 0.10 10 10 100 0.9050966
## 0.10 10 10 500 0.9037506
## 0.10 10 10 1000 0.9020886
## 0.50 1 3 10 0.8967866
## 0.50 1 3 20 0.8991690
## 0.50 1 3 50 0.9008057
## 0.50 1 3 100 0.9015893
## 0.50 1 3 500 0.9020317
## 0.50 1 3 1000 0.9017536
## 0.50 1 5 10 0.8962305
## 0.50 1 5 20 0.8993459
## 0.50 1 5 50 0.9005782
## 0.50 1 5 100 0.9019116
## 0.50 1 5 500 0.9020569
## 0.50 1 5 1000 0.9018042
## 0.50 1 10 10 0.8968182
## 0.50 1 10 20 0.8996050
## 0.50 1 10 50 0.9008563
## 0.50 1 10 100 0.9016272
## 0.50 1 10 500 0.9016778
## 0.50 1 10 1000 0.9019243
## 0.50 5 3 10 0.9005720
## 0.50 5 3 20 0.9016715
## 0.50 5 3 50 0.9016399
## 0.50 5 3 100 0.9011028
## 0.50 5 3 500 0.8955165
## 0.50 5 3 1000 0.8935006
## 0.50 5 5 10 0.9007678
## 0.50 5 5 20 0.9020317
## 0.50 5 5 50 0.9013050
## 0.50 5 5 100 0.9007615
## 0.50 5 5 500 0.8957882
## 0.50 5 5 1000 0.8922620
## 0.50 5 10 10 0.9009890
## 0.50 5 10 20 0.9015704
## 0.50 5 10 50 0.9022719
## 0.50 5 10 100 0.9018611
## 0.50 5 10 500 0.8978357
## 0.50 5 10 1000 0.8947202
## 0.50 10 3 10 0.9019496
## 0.50 10 3 20 0.9017032
## 0.50 10 3 50 0.8990744
## 0.50 10 3 100 0.8976525
## 0.50 10 3 500 0.8911499
## 0.50 10 3 1000 0.8903411
## 0.50 10 5 10 0.9016716
## 0.50 10 5 20 0.9011596
## 0.50 10 5 50 0.9004139
## 0.50 10 5 100 0.8984234
## 0.50 10 5 500 0.8906506
## 0.50 10 5 1000 0.8879459
## 0.50 10 10 10 0.9008184
## 0.50 10 10 20 0.9005025
## 0.50 10 10 50 0.9001107
## 0.50 10 10 100 0.8987583
## 0.50 10 10 500 0.8928433
## 0.50 10 10 1000 0.8903283
## Kappa
## 0.00000000
## 0.00000000
## 0.00000000
## 0.03330150
## 0.20288362
## 0.32936482
## 0.00000000
## 0.00000000
## 0.00000000
## 0.03315239
## 0.19728622
## 0.32813486
## 0.00000000
## 0.00000000
## 0.00000000
## 0.03271344
## 0.19769242
## 0.32838124
## 0.00000000
## 0.00000000
## 0.00000000
## 0.01979491
## 0.42653711
## 0.45011170
## 0.00000000
## 0.00000000
## 0.00000000
## 0.02003560
## 0.42854145
## 0.44991999
## 0.00000000
## 0.00000000
## 0.00000000
## 0.02081668
## 0.42706174
## 0.45227308
## 0.00000000
## 0.00000000
## 0.00000000
## 0.12483824
## 0.45763732
## 0.47169234
## 0.00000000
## 0.00000000
## 0.00000000
## 0.11868765
## 0.45869706
## 0.47305723
## 0.00000000
## 0.00000000
## 0.00000000
## 0.12266339
## 0.45999223
## 0.47374887
## 0.00000000
## 0.03655032
## 0.09264904
## 0.21205073
## 0.40294661
## 0.41960354
## 0.00000000
## 0.03578969
## 0.09164590
## 0.22013113
## 0.40225742
## 0.42208314
## 0.00000000
## 0.03729892
## 0.09105956
## 0.20667468
## 0.40182904
## 0.42004092
## 0.00000000
## 0.04801892
## 0.37246906
## 0.42804343
## 0.46474308
## 0.46687662
## 0.00000000
## 0.04640580
## 0.37491578
## 0.42611118
## 0.46715974
## 0.47151185
## 0.00000000
## 0.03792738
## 0.37450261
## 0.42527966
## 0.46445271
## 0.46730406
## 0.00000000
## 0.14674980
## 0.41009792
## 0.45481343
## 0.47790108
## 0.47523556
## 0.00000000
## 0.16658088
## 0.40525578
## 0.45726659
## 0.47661506
## 0.47428494
## 0.00000000
## 0.15373679
## 0.40903277
## 0.45658443
## 0.47716293
## 0.47609486
## 0.04317232
## 0.07941222
## 0.22375317
## 0.33729930
## 0.42025078
## 0.42356660
## 0.03836378
## 0.07759281
## 0.23054029
## 0.33699293
## 0.42013753
## 0.42609111
## 0.04215116
## 0.07858481
## 0.22110731
## 0.33530467
## 0.41907190
## 0.42396224
## 0.08890398
## 0.32503865
## 0.42544827
## 0.44885653
## 0.46799809
## 0.46960602
## 0.07482536
## 0.32493037
## 0.42734017
## 0.45188310
## 0.46901781
## 0.46754899
## 0.08443919
## 0.31742895
## 0.42815160
## 0.44947247
## 0.46691791
## 0.46932781
## 0.18377223
## 0.38021718
## 0.45552578
## 0.47022117
## 0.47140424
## 0.46760740
## 0.18973125
## 0.38175534
## 0.45623593
## 0.47034148
## 0.47194481
## 0.46867213
## 0.18571495
## 0.37819253
## 0.45303490
## 0.46951103
## 0.47137584
## 0.46752728
## 0.31667335
## 0.38431114
## 0.40855693
## 0.41825851
## 0.42669211
## 0.42903329
## 0.31258808
## 0.38136717
## 0.40627719
## 0.42026914
## 0.42758466
## 0.42684539
## 0.31990239
## 0.38729044
## 0.40874817
## 0.41875263
## 0.42538371
## 0.43090132
## 0.42062776
## 0.43726032
## 0.44544334
## 0.44900952
## 0.44070864
## 0.43705075
## 0.41968157
## 0.43907408
## 0.44614069
## 0.44614137
## 0.44286235
## 0.43510905
## 0.42667586
## 0.43701145
## 0.44596172
## 0.45328366
## 0.44880505
## 0.44307079
## 0.44215422
## 0.44939603
## 0.44175308
## 0.44406976
## 0.43460273
## 0.43098439
## 0.44498464
## 0.44639614
## 0.45068842
## 0.44997511
## 0.43465401
## 0.43020531
## 0.43897011
## 0.44408071
## 0.44870868
## 0.44822852
## 0.43872644
## 0.43528820
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 1000,
## interaction.depth = 10, shrinkage = 0.01 and n.minobsinnode = 10.
## shrinkage interaction.depth n.minobsinnode n.trees Accuracy
## "0.01" " 1" " 3" " 10" "0.8829979"
## "0.01" " 1" " 3" " 20" "0.8829979"
## "0.01" " 1" " 3" " 50" "0.8829979"
## "0.01" " 1" " 3" " 100" "0.8848242"
## "0.01" " 1" " 3" " 500" "0.8934753"
## "0.01" " 1" " 3" "1000" "0.8984422"
## Kappa
## "0.00000000"
## "0.00000000"
## "0.00000000"
## "0.03330150"
## "0.20288362"
## "0.32936482"
model_gbm$bestTune%>%kable()
n.trees | interaction.depth | shrinkage | n.minobsinnode | |
---|---|---|---|---|
54 | 1000 | 10 | 0.01 | 10 |
model_gbm$results%>%head()%>%kable()
shrinkage | interaction.depth | n.minobsinnode | n.trees | Accuracy | Kappa | AccuracySD | KappaSD | |
---|---|---|---|---|---|---|---|---|
1 | 0.01 | 1 | 3 | 10 | 0.8829979 | 0 | 7.02e-05 | 0 |
7 | 0.01 | 1 | 5 | 10 | 0.8829979 | 0 | 7.02e-05 | 0 |
13 | 0.01 | 1 | 10 | 10 | 0.8829979 | 0 | 7.02e-05 | 0 |
55 | 0.05 | 1 | 3 | 10 | 0.8829979 | 0 | 7.02e-05 | 0 |
61 | 0.05 | 1 | 5 | 10 | 0.8829979 | 0 | 7.02e-05 | 0 |
67 | 0.05 | 1 | 10 | 10 | 0.8829979 | 0 | 7.02e-05 | 0 |
#various of finding the row with maximum accuracy
model_gbm$results[which.max(model_gbm$results$Accuracy),]
model_gbm$results%>%filter()%>%dplyr::summarise(max1=max(Accuracy))
model_gbm$results %>% dplyr::slice(which.min(Accuracy ))
model_gbm$results%>%dplyr::slice(which.max(Accuracy ))
model_gbm$results[ which(model_gbm$results$Accuracy ==max(model_gbm$results$Accuracy)) ,]
plot(model_gbm)
#
# pred<-predict(model_gbm,iris_test)
#
# Conf_matrix<-confusionMatrix(pred,iris[1:5,5])
#
# kable(Conf_matrix$table)
#trellis.par.set(caretTheme())
#plot(model_nnet,metric = "kappa",plotType = "level",scales=list(x=list(rot=90)))
#other model plot
trellis.par.set(caretTheme())
plot(model_gbm,plotType = "level",scales=list(x=list(rot=90)))
#using tune length
#fitControl <- trainControl(method = "repeatedcv",number = 5,repeats = 5,
# allowParallel = TRUE )
# Set up training control
fitControl <- trainControl(method = "repeatedcv", # 10fold cross validation
number = 5, # do 5 repititions of cv
summaryFunction=twoClassSummary, # Use AUC to pick the best model
classProbs=TRUE,
allowParallel = TRUE)
#This fixes problem with names when classProbs=TRUE
trainSetglm<-if_else(trainSet[,outcomeName]==0,"No","Yes")
#model_gbm2<-train(trainSet[,Topfivepred],trainSetglm,method='gbm',
# preProcess = c("center","scale"), trControl=fitControl,metric = "ROC")
We can use tuneLength instead of specifying the value of each parameter. This allows any number of possible values for each tuning parameter through tuneLength.
# model_gbm3<-train(trainSet[,Topfivepred],trainSet[,outcomeName],method='gbm',
# trControl=fitControl,interaction.depth=10,n.trees=100,n.minobsinnode=10)
# print(model_gbm)
#save(model_gbm2,file="model_gbm2.RData")
load("model_gbm2.RData")
print(model_gbm2)
## Stochastic Gradient Boosting
##
## 31649 samples
## 5 predictor
## 2 classes: 'No', 'Yes'
##
## Pre-processing: centered (5), scaled (5)
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 25320, 25319, 25320, 25318, 25319
## Resampling results across tuning parameters:
##
## interaction.depth n.trees ROC Sens Spec
## 1 50 0.8769229 0.9891934 0.1909235
## 1 100 0.8839802 0.9765976 0.3197414
## 1 150 0.8861147 0.9744507 0.3418784
## 2 50 0.8844877 0.9733413 0.3440376
## 2 100 0.8875928 0.9709081 0.3734716
## 2 150 0.8881013 0.9708724 0.3729350
## 3 50 0.8868124 0.9721248 0.3618594
## 3 100 0.8882853 0.9716596 0.3651027
## 3 150 0.8884538 0.9723394 0.3578156
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
#plot the model metrics
trellis.par.set(caretTheme())
plot(model_gbm2)
#varImp(object=model_gbm2)
# Predict using the test data
pred<-predict(model_gbm2,testSet)
my_data=data.frame(cbind(predicted=pred,observed=testSet$y))
ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('Stochastic Gradient Boosting Machine')
# Print, plot variable importance
#print(varImp(model_gbm2, scale = FALSE))
#plot(varImp(model_gbm2, scale = FALSE), main="Stochastic Gradient Boosting")
#This fixes problem with names when classProbs=TRUE
testSetglm<-if_else(testSet[,outcomeName]==0,"No","Yes")
confusionMatrix(testSetglm,pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 11971 16
## Yes 1565 11
##
## Accuracy : 0.8834
## 95% CI : (0.8779, 0.8888)
## No Information Rate : 0.998
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0098
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.88438
## Specificity : 0.40741
## Pos Pred Value : 0.99867
## Neg Pred Value : 0.00698
## Prevalence : 0.99801
## Detection Rate : 0.88262
## Detection Prevalence : 0.88380
## Balanced Accuracy : 0.64589
##
## 'Positive' Class : No
##
#other model plot
trellis.par.set(caretTheme())
#plot(model_gbm2,metric = "kappa",plotType = "level",scales=list(x=list(rot=90)))
ggplot(model_gbm2)
## Warning: Ignoring unknown aesthetics: shape
#other model plot
trellis.par.set(caretTheme())
plot(model_gbm2,plotType = "level",scales=list(x=list(rot=90)))
The maximum accuracy of 0.9059 occurs at these parameter combinations shrinkage=0.01,interaction.depth=10,n.minobsinnode=10 and n.trees=1000. The mew model will be fitted with these parameter values.
Neural networks
#This fixes problem with names when classProbs=TRUE
trainSetglm<-if_else(trainSet[,outcomeName]==0,"No","Yes")
#model_nnet<-train(trainSet[,Topfivepred],trainSetglm,method="nnet",trControl=fitControl,
# preProcess = c("center","scale"),metric = "ROC")
#save(model_nnet,file="model_nnet.RData")
load("model_nnet.RData")
print(model_nnet)
## Neural Network
##
## 31649 samples
## 5 predictor
## 2 classes: 'No', 'Yes'
##
## Pre-processing: centered (5), scaled (5)
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 25319, 25319, 25320, 25320, 25318
## Resampling results across tuning parameters:
##
## size decay ROC Sens Spec
## 1 0e+00 0.8851818 0.9647894 0.3918467
## 1 1e-04 0.8855445 0.9655766 0.3915764
## 1 1e-01 0.8864474 0.9663996 0.3980563
## 3 0e+00 0.8881036 0.9695129 0.3707714
## 3 1e-04 0.8890963 0.9716239 0.3586264
## 3 1e-01 0.8896989 0.9737709 0.3535069
## 5 0e+00 0.8888763 0.9731625 0.3591753
## 5 1e-04 0.8899594 0.9738068 0.3586202
## 5 1e-01 0.8901357 0.9744151 0.3540398
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were size = 5 and decay = 0.1.
ggplot(model_nnet)
## Warning: Ignoring unknown aesthetics: shape
#other model plot
trellis.par.set(caretTheme())
plot(model_nnet,plotType = "level",scales=list(x=list(rot=90)))
varImp(object=model_nnet)
## nnet variable importance
##
## Overall
## duration 100.000
## poutcomesuccess 35.001
## contactunknown 34.089
## housingyes 1.523
## monthmar 0.000
# Predict using the test data
pred<-predict(model_nnet,testSet)
my_data=data.frame(cbind(predicted=pred,observed=testSet$y))
ggplot(my_data,aes(predicted,observed))+geom_point()+geom_smooth(method=lm)+ggtitle('Stochastic Gradient Boosting Machine')
# Print, plot variable importance
print(varImp(model_nnet, scale = FALSE))
## nnet variable importance
##
## Overall
## duration 41.478
## poutcomesuccess 20.286
## contactunknown 19.989
## housingyes 9.371
## monthmar 8.875
plot(varImp(model_nnet, scale = FALSE), main="Stochastic Gradient Boosting")
#This fixes problem with names when classProbs=TRUE
testSetglm<-if_else(testSet[,outcomeName]==0,"No","Yes")
confusionMatrix(testSetglm,pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 11694 293
## Yes 1018 558
##
## Accuracy : 0.9033
## 95% CI : (0.8982, 0.9083)
## No Information Rate : 0.9373
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4119
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9199
## Specificity : 0.6557
## Pos Pred Value : 0.9756
## Neg Pred Value : 0.3541
## Prevalence : 0.9373
## Detection Rate : 0.8622
## Detection Prevalence : 0.8838
## Balanced Accuracy : 0.7878
##
## 'Positive' Class : No
##
We try to choose between the models based on metrics such as area under a curve measure ROC which is very popular for binary classification problems,Sensitivity and Specifivity.The accuracy measure is another alternative to picking the best model although in binary classification problems the ROC is a better criteria.
#table$model=data.frame(model=c(" Gradient Boosting Machine","Random Forst","Boosted Logistic Regression","Extreme Gradient Boosting Machine","Logistic Regression","Neural Networks"))
data_frame(model=c(" Stochastic Gradient Boosting Machine","Random Forest","Boosted Logistic Regression","Extreme Gradient Boosting Machine","Logistic Regression","Neural Networks"),
accuracy=c(0.8832,0.9174,0.8416,0.8832 ,0.901,0.9038 ))
######################################################################################
# CHOOSING BETWEEN MODELS
######################################################################################
resamps <- resamples(list(GBM = model_gbm2,
#SVM = model_svm,
RF=re[[3]],
BLR=re[[1]],
EGBM=re[[2]],
GLM=model_glm,
NNET = model_nnet))
resamps
##
## Call:
## resamples.default(x = list(GBM = model_gbm2, RF = re[[3]], BLR =
## re[[1]], EGBM = re[[2]], GLM = model_glm, NNET = model_nnet))
##
## Models: GBM, RF, BLR, EGBM, GLM, NNET
## Number of resamples: 5
## Performance metrics: ROC, Sens, Spec
## Time estimates for: everything, final model fit
summary(resamps)
##
## Call:
## summary.resamples(object = resamps)
##
## Models: GBM, RF, BLR, EGBM, GLM, NNET
## Number of resamples: 5
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## GBM 0.8816546 0.8833298 0.8903776 0.8884538 0.8929705 0.8939366 0
## RF 0.8035906 0.8177582 0.8195248 0.8175060 0.8210548 0.8256018 0
## BLR 0.8319502 0.8358015 0.8450984 0.8439240 0.8510384 0.8557315 0
## EGBM 0.8652316 0.8754254 0.8759783 0.8788672 0.8844292 0.8932713 0
## GLM 0.8796846 0.8821885 0.8827486 0.8838859 0.8850330 0.8897746 0
## NNET 0.8793133 0.8898545 0.8919580 0.8901357 0.8943596 0.8951932 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## GBM 0.9699410 0.9699410 0.9728037 0.9723394 0.9744140 0.9745975 0
## RF 0.9559850 0.9572374 0.9586688 0.9582409 0.9590340 0.9602791 0
## BLR 0.9296833 0.9593845 0.9649311 0.9619261 0.9762075 0.9794239 0
## EGBM 0.9728037 0.9749508 0.9762033 0.9762040 0.9767400 0.9803220 0
## GLM 0.9735194 0.9736983 0.9745975 0.9762757 0.9792449 0.9803185 0
## NNET 0.9704777 0.9711986 0.9749508 0.9744151 0.9769189 0.9785293 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## GBM 0.3454791 0.3481781 0.3527027 0.3578156 0.3540541 0.3886640 0
## RF 0.3198381 0.3216216 0.3229730 0.3291888 0.3292848 0.3522267 0
## BLR 0.2780027 0.3009447 0.3373819 0.3665091 0.3972973 0.5189189 0
## EGBM 0.3090418 0.3162162 0.3270270 0.3324273 0.3360324 0.3738192 0
## GLM 0.2995951 0.3067568 0.3162162 0.3229752 0.3360324 0.3562753 0
## NNET 0.3306343 0.3576248 0.3581081 0.3540398 0.3594595 0.3643725 0
theme1 <- trellis.par.get()
theme1$plot.symbol$col = rgb(.2, .2, .2, .4)
theme1$plot.symbol$pch = 16
theme1$plot.line$col = rgb(1, 0, 0, .7)
theme1$plot.line$lwd <- 2
trellis.par.set(theme1)
bwplot(resamps, layout = c(3, 1))
The neural network with 4 hidden layers provides the highest ROC(receiver operating characteristic curve).The closer the ROC value is to 1,the best the model.It also has one of the highest accuracy measure of the models trained.
trellis.par.set(caretTheme())
dotplot(resamps, metric = "ROC")
trellis.par.set(theme1)
xyplot(resamps, what = "BlandAltman")
splom(resamps)
######################################################################################
# CHOOSING BETWEEN MODELS
######################################################################################
difValues <- diff(resamps)
difValues
##
## Call:
## diff.resamples(x = resamps)
##
## Models: GBM, RF, BLR, EGBM, GLM, NNET
## Metrics: ROC, Sens, Spec
## Number of differences: 15
## p-value adjustment: bonferroni
summary(difValues)
##
## Call:
## summary.diff.resamples(object = difValues)
##
## p-value adjustment: bonferroni
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
##
## ROC
## GBM RF BLR EGBM GLM NNET
## GBM 0.070948 0.044530 0.009587 0.004568 -0.001682
## RF 0.0022120 -0.026418 -0.061361 -0.066380 -0.072630
## BLR 0.0135805 0.1139522 -0.034943 -0.039962 -0.046212
## EGBM 1.0000000 0.0079713 0.2775045 -0.005019 -0.011269
## GLM 1.0000000 0.0003999 0.0294592 1.0000000 -0.006250
## NNET 1.0000000 1.48e-05 0.0131094 1.0000000 1.0000000
##
## Sens
## GBM RF BLR EGBM GLM NNET
## GBM 1.410e-02 1.041e-02 -3.865e-03 -3.936e-03 -2.076e-03
## RF 0.01337 -3.685e-03 -1.796e-02 -1.803e-02 -1.617e-02
## BLR 1.00000 1.00000 -1.428e-02 -1.435e-02 -1.249e-02
## EGBM 0.76082 0.00154 1.00000 -7.177e-05 1.789e-03
## GLM 1.00000 0.01147 1.00000 1.00000 1.861e-03
## NNET 1.00000 0.03067 1.00000 1.00000 1.00000
##
## Spec
## GBM RF BLR EGBM GLM NNET
## GBM 0.028627 -0.008694 0.025388 0.034840 0.003776
## RF 1.00000 -0.037320 -0.003239 0.006214 -0.024851
## BLR 1.00000 1.00000 0.034082 0.043534 0.012469
## EGBM 1.00000 1.00000 1.00000 0.009452 -0.021613
## GLM 0.07509 1.00000 1.00000 1.00000 -0.031065
## NNET 1.00000 0.65561 1.00000 1.00000 1.00000
trellis.par.set(theme1)
bwplot(difValues, layout = c(3, 1))
trellis.par.set(caretTheme())
dotplot(difValues)
gbm.probs<-predict(model_gbm2,testSet,type="prob")
#svm.probs<-predict.train(object=model_svm,testSet[,Topfivepred],type="prob")
blr.probs<-predict(re[[1]],testSet,type="prob")
rf.probs<-predict(re[[3]],testSet,type="prob")
egbm.probs<-predict(re[[2]],testSet,type="prob")
glm.probs<-predict(model_glm,testSet,type="prob")
nnet.probs<-predict(model_nnet,testSet,type="prob")
#svm.probs returning NA
#svm.probs%>%head()
blr.probs%>%head()
#This fixes problem with names when classProbs=TRUE
#testSetglm<-if_else(testSet[,outcomeName]==0,"No","Yes")
#Data$y<-ifelse(Data$y=='no',0,1)
results=data.frame(label=testSetglm)
results$gbm<-gbm.probs[,"Yes"]
#results$svm<-svm.probs[,"Yes"]
results$blr<-blr.probs[,"Yes"]
results$rf<-rf.probs[,"Yes"]
results$egbm<-egbm.probs[,"Yes"]
results$glm<-glm.probs[,"Yes"]
results$nnet<-nnet.probs[,"Yes"]
results%>%head()
The lift function can be used to evaluate probabilities thresholds that can capture a certain percentage of hits. The function requires a set of sample probability predictions (not from the training set) and the true class labels.The lift function does the calculations and the corresponding plot function is used to plot the lift curve (although some call this the gain curve).
trellis.par.set(caretTheme())
#lift_obj <- caret::lift(label ~ gbm + blr+rf+egbm+glm+nnet, data = results)
#save(lift_obj,file="lift_obj.RData")
load("lift_obj.RData")
ggplot(lift_obj, values = 60)
From this we can see that, to find 60 percent of the hits, approximately 62 percent of the data can be sampled (when ordered by the probability predictions). The boosting logistic regression model does somewhat worse than the other five models.
plot(lift_obj, values = 60, auto.key = list(columns = 3,
lines = TRUE,
points = FALSE))
Calibration curves can be used to characterisze how consistent the predicted class probabilities are with the observed event rates.
trellis.par.set(caretTheme())
cal_obj <- caret::calibration(label ~ gbm + blr+rf+egbm+glm+nnet, data = results,
cuts = 13)
plot(cal_obj, type = "l", auto.key = list(columns = 3,
lines = TRUE,
points = FALSE))
ggplot(cal_obj)
Using the lime package to visualize predictors The lime package provides weighted linear combinations of predictors that help explain the model built.
# Create an explainer object
#explainer <- lime(trainSet, model_nnet)
# Explain new observation
#explanation <- explain(testSet, explainer, n_labels = 1, n_features = 5)
# The output is provided in a consistent tabular format and includes the
# output from the model.
#head(explanation)
# And can be visualised directly
#plot_features(explanation)
stopImplicitCluster()