Execute by Neha Raut
A Portugese bank is rolling out term deposit for its customers. They have in the past connected to their customer base through phone calls. Results for these previous campaigns were recorded and have been provided to the current campaign manager to use the same in making this campaign more effective.
Challenges that the manager faces are following:
Customers have recently started to complain that bank’s marketing staff bothers them with irrelevant product calls and this should immediately stop
There is no prior framework for her decide and choose which customer to call and which one to leave alone
She has decided to use past data to automate this decision, instead of manually choosing through each and every customer. Previous campaign data which has been made available to her; contains customer characteristics , campaign characteristics, previous campaign information as well as whether customer ended up subscribing to the product as a result of that campaign or not. Using this she plans to develop a statistical model which given this information predicts whether customer in question will subscribe to the product or not. A successful model which is able to do this, will make her campaign efficiently targeted and less bothering to uninterested customers.
To Build a machine learning predictive model and predict which customers should be targeted for rolling out term deposits by bank.
Evaluation Criterion :KS score on test data. larger KS, better ModelWe have given you two datasets , bank-full_train.csv and bank-full_test.csv . You need to use data bank-full_train to build predictive model for response variable “y”. bank-full_test data contains all other factors except “y”, you need to predict that using the model that you developed and submit your predicted values in a csv files.
Variables : Definition: Type and their categories
Each row represnts characteristic of a single customer . Many categorical data has been coded to mask the data, you dont need to worry about their exact meaning
1 - age (numeric)
2 - job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepre neur”,“student”, “blue-collar”, “self-employed”,“retired”,“technician”, “services”)
3 - marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed)
4 - education (categorical: “unknown”,“secondary”,“primary”,“tertiary”)
5 - default: has credit in default? (binary: “yes”,“no”)
6 - balance: average yearly balance, in euros (numeric)
7 - housing: has housing loan? (binary: “yes”,“no”)
8 - loan: has personal loan? (binary: “yes”,“no”)
Related with the last contact of the current campaign:
9 - contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”)
10 - day: last contact day of the month (numeric))
Direct Marketing Campaign: Details and Phase I Tasks
11 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, . . . , “nov”, “dec”)
12 - duration: last contact duration, in seconds (numeric)
other attributes: 13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
15 - previous: number of contacts performed before this campaign and for this client (numeric)
16 - poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)
Output variable (desired target):
17 - y - has the client subscribed a term deposit? (binary: “yes”,“no”)
Combining both train n test datasets prior to data preparation.
loading library dplyr
library(dplyr)
library(ggplot2)
library(ROCR)
Read train and test datasets:
train=read.csv("bank-full_train.csv",stringsAsFactors = FALSE, header=T)
test=read.csv("bank-full_test.csv",stringsAsFactors = FALSE, header=T)
Combining both train n test datasets prior to data preparation.
Before combining however , we’ll need some placeholder column which we can use to differentiate between observations coming from train and test data. Also we’ll need to add a column for response to test data so that we have same columns in both train and test. We’ll fill test’s response column with NAs.
#Combine both train and test data
test$y=NA
train$data='train'
test$data='test'
all_data=rbind(train,test)
glimpse(all_data)
## Observations: 45,211
## Variables: 19
## $ age <int> 45, 34, 40, 58, 59, 36, 34, 38, 52, 48, 28, 43, 37, ...
## $ job <chr> "blue-collar", "admin.", "technician", "self-employe...
## $ marital <chr> "married", "divorced", "divorced", "married", "marri...
## $ education <chr> "secondary", "secondary", "secondary", "tertiary", "...
## $ default <chr> "no", "no", "no", "no", "no", "no", "yes", "no", "no...
## $ balance <int> 2, 0, 311, 5810, 169, 24, -868, -140, 98, 1203, 387,...
## $ housing <chr> "no", "no", "no", "no", "yes", "yes", "yes", "no", "...
## $ loan <chr> "no", "no", "no", "no", "no", "yes", "no", "no", "no...
## $ contact <chr> "cellular", "cellular", "cellular", "cellular", "unk...
## $ day <int> 26, 10, 6, 12, 16, 11, 30, 25, 28, 6, 3, 19, 21, 21,...
## $ month <chr> "aug", "jul", "aug", "mar", "may", "jun", "jun", "ju...
## $ duration <int> 105, 268, 738, 139, 181, 100, 198, 456, 103, 61, 158...
## $ campaign <int> 10, 1, 2, 1, 3, 5, 2, 3, 1, 5, 5, 1, 4, 1, 2, 4, 1, ...
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, 245, -1, -1, 198, -1...
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0...
## $ poutcome <chr> "unknown", "unknown", "unknown", "unknown", "unknown...
## $ ID <int> 22944, 13870, 19301, 31334, 3849, 10192, 12411, 1231...
## $ y <chr> "no", "no", "yes", "yes", "no", "no", "no", "no", "n...
## $ data <chr> "train", "train", "train", "train", "train", "train"...
apply(all_data,2,function(x)sum(is.na(x)))
## age job marital education default balance housing
## 0 0 0 0 0 0 0
## loan contact day month duration campaign pdays
## 0 0 0 0 0 0 0
## previous poutcome ID y data
## 0 0 0 13564 0
glimpse(all_data)
## Observations: 45,211
## Variables: 19
## $ age <int> 45, 34, 40, 58, 59, 36, 34, 38, 52, 48, 28, 43, 37, ...
## $ job <chr> "blue-collar", "admin.", "technician", "self-employe...
## $ marital <chr> "married", "divorced", "divorced", "married", "marri...
## $ education <chr> "secondary", "secondary", "secondary", "tertiary", "...
## $ default <chr> "no", "no", "no", "no", "no", "no", "yes", "no", "no...
## $ balance <int> 2, 0, 311, 5810, 169, 24, -868, -140, 98, 1203, 387,...
## $ housing <chr> "no", "no", "no", "no", "yes", "yes", "yes", "no", "...
## $ loan <chr> "no", "no", "no", "no", "no", "yes", "no", "no", "no...
## $ contact <chr> "cellular", "cellular", "cellular", "cellular", "unk...
## $ day <int> 26, 10, 6, 12, 16, 11, 30, 25, 28, 6, 3, 19, 21, 21,...
## $ month <chr> "aug", "jul", "aug", "mar", "may", "jun", "jun", "ju...
## $ duration <int> 105, 268, 738, 139, 181, 100, 198, 456, 103, 61, 158...
## $ campaign <int> 10, 1, 2, 1, 3, 5, 2, 3, 1, 5, 5, 1, 4, 1, 2, 4, 1, ...
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, 245, -1, -1, 198, -1...
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0...
## $ poutcome <chr> "unknown", "unknown", "unknown", "unknown", "unknown...
## $ ID <int> 22944, 13870, 19301, 31334, 3849, 10192, 12411, 1231...
## $ y <chr> "no", "no", "yes", "yes", "no", "no", "no", "no", "n...
## $ data <chr> "train", "train", "train", "train", "train", "train"...
finding out the number of distinct caztegories above in character variables.(excludes y as its the target variable)
for(i in 1:ncol(all_data)){
if(class(all_data[,i])=="character"){
if(names(all_data)[i]!="y"){
message=paste("Number of categories in ",names(all_data)[i]," : ")
num.cat=length(unique(all_data[,i]))
print(paste0(message,num.cat))
}
}
}
## [1] "Number of categories in job : 12"
## [1] "Number of categories in marital : 3"
## [1] "Number of categories in education : 4"
## [1] "Number of categories in default : 2"
## [1] "Number of categories in housing : 2"
## [1] "Number of categories in loan : 2"
## [1] "Number of categories in contact : 3"
## [1] "Number of categories in month : 12"
## [1] "Number of categories in poutcome : 4"
## [1] "Number of categories in data : 2"
Creating dummy variables by combining similar categories for variable job(char type)
t=table(all_data$job)
sort(t)
##
## unknown student housemaid unemployed entrepreneur
## 288 938 1240 1303 1487
## self-employed retired services admin. technician
## 1579 2264 4154 5171 7597
## management blue-collar
## 9458 9732
final=round(prop.table(table(all_data$job,all_data$y),1)*100,1)
final
##
## no yes
## admin. 87.9 12.1
## blue-collar 92.8 7.2
## entrepreneur 90.7 9.3
## housemaid 92.4 7.6
## management 86.3 13.7
## retired 76.8 23.2
## self-employed 87.9 12.1
## services 91.0 9.0
## student 70.8 29.2
## technician 88.6 11.4
## unemployed 85.1 14.9
## unknown 88.2 11.8
#Add Margins
s=addmargins(final,2) #add margin across Y ,2 means we will get sum on column
sort(s[,1])
## student retired unemployed management admin.
## 70.8 76.8 85.1 86.3 87.9
## self-employed unknown technician entrepreneur services
## 87.9 88.2 88.6 90.7 91.0
## housemaid blue-collar
## 92.4 92.8
#create n-1 dummies and ignore which close to big
all_data=all_data %>%
mutate(job_1=as.numeric(job %in% c("self-employed","unknown","technician")),
job_2=as.numeric(job %in% c("services","housemaid","entrepreneur")),
job_3=as.numeric(job %in% c("management","admin")),
job_4=as.numeric(job=="student"),
job_5=as.numeric(job=="retired"),
job_6=as.numeric(job=="unemployed")) %>%
select(-job)
glimpse(all_data)
## Observations: 45,211
## Variables: 24
## $ age <int> 45, 34, 40, 58, 59, 36, 34, 38, 52, 48, 28, 43, 37, ...
## $ marital <chr> "married", "divorced", "divorced", "married", "marri...
## $ education <chr> "secondary", "secondary", "secondary", "tertiary", "...
## $ default <chr> "no", "no", "no", "no", "no", "no", "yes", "no", "no...
## $ balance <int> 2, 0, 311, 5810, 169, 24, -868, -140, 98, 1203, 387,...
## $ housing <chr> "no", "no", "no", "no", "yes", "yes", "yes", "no", "...
## $ loan <chr> "no", "no", "no", "no", "no", "yes", "no", "no", "no...
## $ contact <chr> "cellular", "cellular", "cellular", "cellular", "unk...
## $ day <int> 26, 10, 6, 12, 16, 11, 30, 25, 28, 6, 3, 19, 21, 21,...
## $ month <chr> "aug", "jul", "aug", "mar", "may", "jun", "jun", "ju...
## $ duration <int> 105, 268, 738, 139, 181, 100, 198, 456, 103, 61, 158...
## $ campaign <int> 10, 1, 2, 1, 3, 5, 2, 3, 1, 5, 5, 1, 4, 1, 2, 4, 1, ...
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, 245, -1, -1, 198, -1...
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0...
## $ poutcome <chr> "unknown", "unknown", "unknown", "unknown", "unknown...
## $ ID <int> 22944, 13870, 19301, 31334, 3849, 10192, 12411, 1231...
## $ y <chr> "no", "no", "yes", "yes", "no", "no", "no", "no", "n...
## $ data <chr> "train", "train", "train", "train", "train", "train"...
## $ job_1 <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1...
## $ job_2 <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0...
## $ job_3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0...
## $ job_4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ job_5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ job_6 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
#Marital
t=table(all_data$marital)
sort(t)
##
## divorced single married
## 5207 12790 27214
all_data=all_data %>%
mutate(divorced=as.numeric(marital %in% c("divorced")),
single=as.numeric(marital %in% c("single"))
) %>%
select(-marital)
glimpse(all_data)
## Observations: 45,211
## Variables: 25
## $ age <int> 45, 34, 40, 58, 59, 36, 34, 38, 52, 48, 28, 43, 37, ...
## $ education <chr> "secondary", "secondary", "secondary", "tertiary", "...
## $ default <chr> "no", "no", "no", "no", "no", "no", "yes", "no", "no...
## $ balance <int> 2, 0, 311, 5810, 169, 24, -868, -140, 98, 1203, 387,...
## $ housing <chr> "no", "no", "no", "no", "yes", "yes", "yes", "no", "...
## $ loan <chr> "no", "no", "no", "no", "no", "yes", "no", "no", "no...
## $ contact <chr> "cellular", "cellular", "cellular", "cellular", "unk...
## $ day <int> 26, 10, 6, 12, 16, 11, 30, 25, 28, 6, 3, 19, 21, 21,...
## $ month <chr> "aug", "jul", "aug", "mar", "may", "jun", "jun", "ju...
## $ duration <int> 105, 268, 738, 139, 181, 100, 198, 456, 103, 61, 158...
## $ campaign <int> 10, 1, 2, 1, 3, 5, 2, 3, 1, 5, 5, 1, 4, 1, 2, 4, 1, ...
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, 245, -1, -1, 198, -1...
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0...
## $ poutcome <chr> "unknown", "unknown", "unknown", "unknown", "unknown...
## $ ID <int> 22944, 13870, 19301, 31334, 3849, 10192, 12411, 1231...
## $ y <chr> "no", "no", "yes", "yes", "no", "no", "no", "no", "n...
## $ data <chr> "train", "train", "train", "train", "train", "train"...
## $ job_1 <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1...
## $ job_2 <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0...
## $ job_3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0...
## $ job_4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ job_5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ job_6 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ divorced <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0...
## $ single <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0...
#Education
t=table(all_data$education)
sort(t)
##
## unknown primary tertiary secondary
## 1857 6851 13301 23202
all_data=all_data %>%
mutate(edu_primary=as.numeric(education %in% c("primary")),
edu_sec=as.numeric(education %in% c("secondary")),
edu_tert=as.numeric(education %in% c("tertiary"))
) %>%
select(-education)
glimpse(all_data)
## Observations: 45,211
## Variables: 27
## $ age <int> 45, 34, 40, 58, 59, 36, 34, 38, 52, 48, 28, 43, 37...
## $ default <chr> "no", "no", "no", "no", "no", "no", "yes", "no", "...
## $ balance <int> 2, 0, 311, 5810, 169, 24, -868, -140, 98, 1203, 38...
## $ housing <chr> "no", "no", "no", "no", "yes", "yes", "yes", "no",...
## $ loan <chr> "no", "no", "no", "no", "no", "yes", "no", "no", "...
## $ contact <chr> "cellular", "cellular", "cellular", "cellular", "u...
## $ day <int> 26, 10, 6, 12, 16, 11, 30, 25, 28, 6, 3, 19, 21, 2...
## $ month <chr> "aug", "jul", "aug", "mar", "may", "jun", "jun", "...
## $ duration <int> 105, 268, 738, 139, 181, 100, 198, 456, 103, 61, 1...
## $ campaign <int> 10, 1, 2, 1, 3, 5, 2, 3, 1, 5, 5, 1, 4, 1, 2, 4, 1...
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, 245, -1, -1, 198, ...
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0,...
## $ poutcome <chr> "unknown", "unknown", "unknown", "unknown", "unkno...
## $ ID <int> 22944, 13870, 19301, 31334, 3849, 10192, 12411, 12...
## $ y <chr> "no", "no", "yes", "yes", "no", "no", "no", "no", ...
## $ data <chr> "train", "train", "train", "train", "train", "trai...
## $ job_1 <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ job_2 <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,...
## $ job_3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,...
## $ job_4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ job_5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ job_6 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ divorced <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ single <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,...
## $ edu_primary <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ edu_sec <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1,...
## $ edu_tert <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,...
#for varible default
table(all_data$default)
##
## no yes
## 44396 815
all_data$default=as.numeric(all_data$default=="yes")
#Housing
table(all_data$housing)
##
## no yes
## 20081 25130
all_data$housing=as.numeric(all_data$housing=="yes")
glimpse(all_data)
## Observations: 45,211
## Variables: 27
## $ age <int> 45, 34, 40, 58, 59, 36, 34, 38, 52, 48, 28, 43, 37...
## $ default <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ balance <int> 2, 0, 311, 5810, 169, 24, -868, -140, 98, 1203, 38...
## $ housing <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,...
## $ loan <chr> "no", "no", "no", "no", "no", "yes", "no", "no", "...
## $ contact <chr> "cellular", "cellular", "cellular", "cellular", "u...
## $ day <int> 26, 10, 6, 12, 16, 11, 30, 25, 28, 6, 3, 19, 21, 2...
## $ month <chr> "aug", "jul", "aug", "mar", "may", "jun", "jun", "...
## $ duration <int> 105, 268, 738, 139, 181, 100, 198, 456, 103, 61, 1...
## $ campaign <int> 10, 1, 2, 1, 3, 5, 2, 3, 1, 5, 5, 1, 4, 1, 2, 4, 1...
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, 245, -1, -1, 198, ...
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0,...
## $ poutcome <chr> "unknown", "unknown", "unknown", "unknown", "unkno...
## $ ID <int> 22944, 13870, 19301, 31334, 3849, 10192, 12411, 12...
## $ y <chr> "no", "no", "yes", "yes", "no", "no", "no", "no", ...
## $ data <chr> "train", "train", "train", "train", "train", "trai...
## $ job_1 <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ job_2 <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,...
## $ job_3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,...
## $ job_4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ job_5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ job_6 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ divorced <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ single <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,...
## $ edu_primary <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ edu_sec <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1,...
## $ edu_tert <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,...
#Loan
table(all_data$loan)
##
## no yes
## 37967 7244
all_data$loan=as.numeric(all_data$loan=="yes")
glimpse(all_data)
## Observations: 45,211
## Variables: 27
## $ age <int> 45, 34, 40, 58, 59, 36, 34, 38, 52, 48, 28, 43, 37...
## $ default <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ balance <int> 2, 0, 311, 5810, 169, 24, -868, -140, 98, 1203, 38...
## $ housing <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,...
## $ loan <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ contact <chr> "cellular", "cellular", "cellular", "cellular", "u...
## $ day <int> 26, 10, 6, 12, 16, 11, 30, 25, 28, 6, 3, 19, 21, 2...
## $ month <chr> "aug", "jul", "aug", "mar", "may", "jun", "jun", "...
## $ duration <int> 105, 268, 738, 139, 181, 100, 198, 456, 103, 61, 1...
## $ campaign <int> 10, 1, 2, 1, 3, 5, 2, 3, 1, 5, 5, 1, 4, 1, 2, 4, 1...
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, 245, -1, -1, 198, ...
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0,...
## $ poutcome <chr> "unknown", "unknown", "unknown", "unknown", "unkno...
## $ ID <int> 22944, 13870, 19301, 31334, 3849, 10192, 12411, 12...
## $ y <chr> "no", "no", "yes", "yes", "no", "no", "no", "no", ...
## $ data <chr> "train", "train", "train", "train", "train", "trai...
## $ job_1 <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ job_2 <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,...
## $ job_3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,...
## $ job_4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ job_5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ job_6 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ divorced <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ single <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,...
## $ edu_primary <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ edu_sec <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1,...
## $ edu_tert <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,...
#Contact
t=table(all_data$contact)
sort(t)
##
## telephone unknown cellular
## 2906 13020 29285
all_data=all_data %>%
mutate(co_cellular=as.numeric(contact %in% c("cellular")),
co_tel=as.numeric(contact %in% c("telephone"))
) %>%
select(-contact)
glimpse(all_data)
## Observations: 45,211
## Variables: 28
## $ age <int> 45, 34, 40, 58, 59, 36, 34, 38, 52, 48, 28, 43, 37...
## $ default <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ balance <int> 2, 0, 311, 5810, 169, 24, -868, -140, 98, 1203, 38...
## $ housing <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,...
## $ loan <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ day <int> 26, 10, 6, 12, 16, 11, 30, 25, 28, 6, 3, 19, 21, 2...
## $ month <chr> "aug", "jul", "aug", "mar", "may", "jun", "jun", "...
## $ duration <int> 105, 268, 738, 139, 181, 100, 198, 456, 103, 61, 1...
## $ campaign <int> 10, 1, 2, 1, 3, 5, 2, 3, 1, 5, 5, 1, 4, 1, 2, 4, 1...
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, 245, -1, -1, 198, ...
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0,...
## $ poutcome <chr> "unknown", "unknown", "unknown", "unknown", "unkno...
## $ ID <int> 22944, 13870, 19301, 31334, 3849, 10192, 12411, 12...
## $ y <chr> "no", "no", "yes", "yes", "no", "no", "no", "no", ...
## $ data <chr> "train", "train", "train", "train", "train", "trai...
## $ job_1 <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ job_2 <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,...
## $ job_3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,...
## $ job_4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ job_5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ job_6 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ divorced <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ single <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,...
## $ edu_primary <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ edu_sec <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1,...
## $ edu_tert <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,...
## $ co_cellular <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0,...
## $ co_tel <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
#Month
table(all_data$month)
##
## apr aug dec feb jan jul jun mar may nov oct sep
## 2932 6247 214 2649 1403 6895 5341 477 13766 3970 738 579
finalmnth=round(prop.table(table(all_data$month,all_data$y),1)*100,1)
sss=addmargins(finalmnth,2) #adding margin across Y
sort(sss[,1])
## mar oct sep dec apr feb aug jan nov jun jul may
## 46.7 55.0 57.8 58.0 80.1 83.1 88.7 89.1 89.3 90.1 91.0 93.2
#Ignor may
all_data=all_data %>%
mutate(month_1=as.numeric(month %in% c("aug","jan","jun","nov","jul")),
month_2=as.numeric(month %in% c("dec","sep")),
month_3=as.numeric(month=="mar"),
month_4=as.numeric(month=="oct"),
month_5=as.numeric(month=="apr"),
month_6=as.numeric(month=="feb")) %>%
select(-month)
glimpse(all_data)
## Observations: 45,211
## Variables: 33
## $ age <int> 45, 34, 40, 58, 59, 36, 34, 38, 52, 48, 28, 43, 37...
## $ default <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ balance <int> 2, 0, 311, 5810, 169, 24, -868, -140, 98, 1203, 38...
## $ housing <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,...
## $ loan <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ day <int> 26, 10, 6, 12, 16, 11, 30, 25, 28, 6, 3, 19, 21, 2...
## $ duration <int> 105, 268, 738, 139, 181, 100, 198, 456, 103, 61, 1...
## $ campaign <int> 10, 1, 2, 1, 3, 5, 2, 3, 1, 5, 5, 1, 4, 1, 2, 4, 1...
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, 245, -1, -1, 198, ...
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0,...
## $ poutcome <chr> "unknown", "unknown", "unknown", "unknown", "unkno...
## $ ID <int> 22944, 13870, 19301, 31334, 3849, 10192, 12411, 12...
## $ y <chr> "no", "no", "yes", "yes", "no", "no", "no", "no", ...
## $ data <chr> "train", "train", "train", "train", "train", "trai...
## $ job_1 <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ job_2 <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,...
## $ job_3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,...
## $ job_4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ job_5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ job_6 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ divorced <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ single <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,...
## $ edu_primary <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ edu_sec <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1,...
## $ edu_tert <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,...
## $ co_cellular <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0,...
## $ co_tel <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ month_1 <dbl> 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0,...
## $ month_2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ month_3 <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ month_4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ month_5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ month_6 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
#Outcome
t=table(all_data$poutcome)
sort(t)
##
## success other failure unknown
## 1511 1840 4901 36959
all_data=all_data %>%
mutate(poc_success=as.numeric(poutcome %in% c("success")),
poc_failure=as.numeric(poutcome %in% c("failure")),
poc_other=as.numeric(poutcome %in% c("other"))
)%>%
select(-poutcome)
glimpse(all_data)
## Observations: 45,211
## Variables: 35
## $ age <int> 45, 34, 40, 58, 59, 36, 34, 38, 52, 48, 28, 43, 37...
## $ default <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ balance <int> 2, 0, 311, 5810, 169, 24, -868, -140, 98, 1203, 38...
## $ housing <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,...
## $ loan <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ day <int> 26, 10, 6, 12, 16, 11, 30, 25, 28, 6, 3, 19, 21, 2...
## $ duration <int> 105, 268, 738, 139, 181, 100, 198, 456, 103, 61, 1...
## $ campaign <int> 10, 1, 2, 1, 3, 5, 2, 3, 1, 5, 5, 1, 4, 1, 2, 4, 1...
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, 245, -1, -1, 198, ...
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0,...
## $ ID <int> 22944, 13870, 19301, 31334, 3849, 10192, 12411, 12...
## $ y <chr> "no", "no", "yes", "yes", "no", "no", "no", "no", ...
## $ data <chr> "train", "train", "train", "train", "train", "trai...
## $ job_1 <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ job_2 <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,...
## $ job_3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,...
## $ job_4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ job_5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ job_6 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ divorced <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ single <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,...
## $ edu_primary <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ edu_sec <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1,...
## $ edu_tert <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,...
## $ co_cellular <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0,...
## $ co_tel <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ month_1 <dbl> 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0,...
## $ month_2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ month_3 <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ month_4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ month_5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ month_6 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ poc_success <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ poc_failure <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ poc_other <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,...
Prep is done now We need to convert our Response variable to 1/0 or yes/no
all_data$y=as.numeric(all_data$y=="yes")
table(all_data$y)
##
## 0 1
## 27927 3720
#Next we take care of missing values if any in the data.
all_data=all_data[!((is.na(all_data$y)) & all_data$data=='train'), ]
for(col in names(all_data)){
if(sum(is.na(all_data[,col]))>0 & !(col %in% c("data","y"))){
all_data[is.na(all_data[,col]),col]=mean(all_data[all_data$data=='train',col],na.rm=T)
}
}
sum(is.na(all_data$data=='train'))
## [1] 0
Thus data preparation is done and we will now seperate both test n train data.
train=all_data %>%
filter(data=='train') %>%
select(-data) #31647,34
test=all_data %>%
filter(data=='test') %>%
select(-data,-y)
We will use train for logistic regression model building and use train_25 to test the performance of the model thus built. Lets build logistic regression model on train dataset.
set.seed(5)
s=sample(1:nrow(train),0.75*nrow(train))
train_75=train[s,] #23735,34
test_25=train[-s,]#7912,34
#Find out vif >5
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
for_vif=lm(y~.,data=train_75)
summary(for_vif)
##
## Call:
## lm(formula = y ~ ., data = train_75)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.73998 -0.11938 -0.03431 0.03900 1.04734
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.645e-01 1.502e-02 -10.955 < 2e-16 ***
## age 4.555e-04 2.130e-04 2.138 0.032499 *
## default 1.388e-03 1.301e-02 0.107 0.915086
## balance 1.064e-06 5.623e-07 1.893 0.058378 .
## housing -3.364e-02 4.129e-03 -8.147 3.92e-16 ***
## loan -1.477e-02 4.817e-03 -3.066 0.002170 **
## day 5.519e-04 2.235e-04 2.469 0.013550 *
## duration 4.940e-04 6.874e-06 71.874 < 2e-16 ***
## campaign -4.345e-04 5.931e-04 -0.733 0.463742
## pdays -6.411e-05 3.771e-05 -1.700 0.089095 .
## previous 5.690e-04 7.702e-04 0.739 0.460063
## ID 7.573e-06 2.548e-07 29.718 < 2e-16 ***
## job_1 5.941e-03 5.158e-03 1.152 0.249391
## job_2 -1.916e-03 5.401e-03 -0.355 0.722799
## job_3 1.312e-02 6.282e-03 2.089 0.036736 *
## job_4 3.814e-02 1.315e-02 2.900 0.003738 **
## job_5 3.533e-02 9.541e-03 3.703 0.000214 ***
## job_6 -5.377e-03 1.082e-02 -0.497 0.619383
## divorced 1.301e-02 5.634e-03 2.309 0.020963 *
## single 2.035e-02 4.461e-03 4.562 5.10e-06 ***
## edu_primary -5.535e-03 9.893e-03 -0.560 0.575810
## edu_sec 2.825e-03 9.135e-03 0.309 0.757090
## edu_tert 9.105e-03 9.789e-03 0.930 0.352326
## co_cellular -1.022e-01 6.779e-03 -15.073 < 2e-16 ***
## co_tel -1.131e-01 9.408e-03 -12.021 < 2e-16 ***
## month_1 2.873e-02 4.896e-03 5.869 4.43e-09 ***
## month_2 9.915e-02 1.438e-02 6.894 5.55e-12 ***
## month_3 3.242e-01 1.765e-02 18.372 < 2e-16 ***
## month_4 1.803e-01 1.449e-02 12.442 < 2e-16 ***
## month_5 4.452e-02 7.883e-03 5.648 1.64e-08 ***
## month_6 3.337e-02 8.732e-03 3.821 0.000133 ***
## poc_success 3.540e-01 1.219e-02 29.048 < 2e-16 ***
## poc_failure -2.388e-02 1.112e-02 -2.148 0.031739 *
## poc_other 2.657e-04 1.299e-02 0.020 0.983679
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2667 on 23701 degrees of freedom
## Multiple R-squared: 0.3258, Adjusted R-squared: 0.3248
## F-statistic: 347 on 33 and 23701 DF, p-value: < 2.2e-16
sort(vif(for_vif),decreasing = T)[1:3]
## edu_sec edu_tert pdays
## 6.952724 6.607517 4.769001
#So remove edu sec from train
for_vif=lm(y~.-edu_sec,data=train_75)
sort(vif(for_vif),decreasing = T)[1:3]
## pdays poc_failure ID
## 4.768821 3.965875 3.701446
summary(for_vif)
##
## Call:
## lm(formula = y ~ . - edu_sec, data = train_75)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.73976 -0.11941 -0.03441 0.03904 1.04751
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.617e-01 1.193e-02 -13.562 < 2e-16 ***
## age 4.491e-04 2.120e-04 2.118 0.034151 *
## default 1.367e-03 1.301e-02 0.105 0.916368
## balance 1.065e-06 5.623e-07 1.894 0.058287 .
## housing -3.361e-02 4.128e-03 -8.142 4.09e-16 ***
## loan -1.470e-02 4.812e-03 -3.055 0.002252 **
## day 5.518e-04 2.235e-04 2.469 0.013564 *
## duration 4.940e-04 6.874e-06 71.875 < 2e-16 ***
## campaign -4.331e-04 5.930e-04 -0.730 0.465213
## pdays -6.418e-05 3.770e-05 -1.702 0.088726 .
## previous 5.689e-04 7.702e-04 0.739 0.460100
## ID 7.573e-06 2.548e-07 29.717 < 2e-16 ***
## job_1 5.937e-03 5.158e-03 1.151 0.249690
## job_2 -1.901e-03 5.401e-03 -0.352 0.724917
## job_3 1.306e-02 6.278e-03 2.080 0.037554 *
## job_4 3.777e-02 1.310e-02 2.884 0.003935 **
## job_5 3.542e-02 9.537e-03 3.714 0.000204 ***
## job_6 -5.324e-03 1.082e-02 -0.492 0.622741
## divorced 1.306e-02 5.632e-03 2.318 0.020434 *
## single 2.032e-02 4.460e-03 4.556 5.23e-06 ***
## edu_primary -8.135e-03 5.217e-03 -1.559 0.118898
## edu_tert 6.514e-03 5.064e-03 1.286 0.198365
## co_cellular -1.021e-01 6.775e-03 -15.071 < 2e-16 ***
## co_tel -1.131e-01 9.407e-03 -12.018 < 2e-16 ***
## month_1 2.873e-02 4.896e-03 5.869 4.45e-09 ***
## month_2 9.908e-02 1.438e-02 6.890 5.71e-12 ***
## month_3 3.242e-01 1.765e-02 18.371 < 2e-16 ***
## month_4 1.803e-01 1.449e-02 12.442 < 2e-16 ***
## month_5 4.453e-02 7.883e-03 5.649 1.63e-08 ***
## month_6 3.335e-02 8.732e-03 3.820 0.000134 ***
## poc_success 3.541e-01 1.219e-02 29.052 < 2e-16 ***
## poc_failure -2.385e-02 1.112e-02 -2.145 0.031923 *
## poc_other 3.201e-04 1.299e-02 0.025 0.980335
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2667 on 23702 degrees of freedom
## Multiple R-squared: 0.3258, Adjusted R-squared: 0.3248
## F-statistic: 357.9 on 32 and 23702 DF, p-value: < 2.2e-16
lets build final logistic model on significant variables on dataset fit_train
#Lets build model on fit_train dataset, always use family as binomial for logistic regreession:
fit=glm(y~.,data=train_75, family = "binomial") #32 predictor var
summary(fit) #we get aic as 10789.53 #Lower the aic good thge model is
##
## Call:
## glm(formula = y ~ ., family = "binomial", data = train_75)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.8435 -0.3500 -0.2119 -0.1152 3.2744
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.337e+00 2.387e-01 -26.550 < 2e-16 ***
## age 1.836e-03 3.056e-03 0.601 0.547894
## default 2.645e-02 2.455e-01 0.108 0.914198
## balance 1.292e-05 6.806e-06 1.898 0.057750 .
## housing -4.661e-01 6.120e-02 -7.616 2.61e-14 ***
## loan -2.623e-01 8.336e-02 -3.147 0.001651 **
## day 5.016e-03 3.219e-03 1.558 0.119175
## duration 4.644e-03 9.512e-05 48.828 < 2e-16 ***
## campaign -5.297e-02 1.391e-02 -3.807 0.000141 ***
## pdays -5.368e-05 4.011e-04 -0.134 0.893533
## previous 4.628e-03 7.213e-03 0.642 0.521155
## ID 1.025e-04 3.560e-06 28.783 < 2e-16 ***
## job_1 7.968e-02 7.810e-02 1.020 0.307607
## job_2 -4.310e-02 8.704e-02 -0.495 0.620497
## job_3 1.573e-01 8.991e-02 1.750 0.080182 .
## job_4 1.051e-01 1.439e-01 0.730 0.465176
## job_5 2.545e-01 1.269e-01 2.007 0.044801 *
## job_6 -1.377e-01 1.508e-01 -0.913 0.360984
## divorced 2.404e-01 8.274e-02 2.906 0.003661 **
## single 2.588e-01 6.507e-02 3.978 6.94e-05 ***
## edu_primary -1.090e-01 1.454e-01 -0.749 0.453666
## edu_sec 9.410e-02 1.291e-01 0.729 0.466049
## edu_tert 1.667e-01 1.357e-01 1.228 0.219420
## co_cellular -8.584e-01 1.030e-01 -8.331 < 2e-16 ***
## co_tel -1.037e+00 1.424e-01 -7.286 3.20e-13 ***
## month_1 5.589e-01 7.625e-02 7.329 2.32e-13 ***
## month_2 5.998e-01 1.384e-01 4.335 1.46e-05 ***
## month_3 2.263e+00 1.645e-01 13.755 < 2e-16 ***
## month_4 1.134e+00 1.392e-01 8.149 3.67e-16 ***
## month_5 7.157e-01 9.684e-02 7.390 1.46e-13 ***
## month_6 6.463e-01 1.141e-01 5.663 1.48e-08 ***
## poc_success 1.514e+00 1.153e-01 13.132 < 2e-16 ***
## poc_failure -3.830e-01 1.238e-01 -3.094 0.001975 **
## poc_other -1.598e-01 1.441e-01 -1.109 0.267591
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 17389 on 23734 degrees of freedom
## Residual deviance: 10730 on 23701 degrees of freedom
## AIC: 10798
##
## Number of Fisher Scoring iterations: 6
All VIF values are under control.Now we look at the model summary , specifically p-values asscociated with the variables. We can drop variables with high p-values [ >0.05] one by one or we can use step function which drops vars based on AIC score one by one. Although the methodology is different but end result is generally. similar due to both of them targetting vars which do not contribute towards explaning not very well.
#Remove variable having p>0.5 one by one
fit=step(fit)
## Start: AIC=10798.48
## y ~ age + default + balance + housing + loan + day + duration +
## campaign + pdays + previous + ID + job_1 + job_2 + job_3 +
## job_4 + job_5 + job_6 + divorced + single + edu_primary +
## edu_sec + edu_tert + co_cellular + co_tel + month_1 + month_2 +
## month_3 + month_4 + month_5 + month_6 + poc_success + poc_failure +
## poc_other
##
## Df Deviance AIC
## - default 1 10730 10796
## - pdays 1 10730 10796
## - job_2 1 10731 10797
## - previous 1 10731 10797
## - age 1 10731 10797
## - job_4 1 10731 10797
## - edu_sec 1 10731 10797
## - edu_primary 1 10731 10797
## - job_6 1 10731 10797
## - job_1 1 10732 10798
## - poc_other 1 10732 10798
## - edu_tert 1 10732 10798
## <none> 10730 10798
## - day 1 10733 10799
## - job_3 1 10734 10800
## - balance 1 10734 10800
## - job_5 1 10734 10800
## - divorced 1 10739 10805
## - poc_failure 1 10740 10806
## - loan 1 10741 10807
## - single 1 10746 10812
## - campaign 1 10746 10812
## - month_2 1 10749 10815
## - month_6 1 10762 10828
## - month_5 1 10784 10850
## - co_tel 1 10785 10851
## - month_1 1 10785 10851
## - housing 1 10789 10855
## - month_4 1 10795 10861
## - co_cellular 1 10797 10863
## - poc_success 1 10906 10972
## - month_3 1 10911 10977
## - ID 1 11574 11640
## - duration 1 14015 14081
##
## Step: AIC=10796.49
## y ~ age + balance + housing + loan + day + duration + campaign +
## pdays + previous + ID + job_1 + job_2 + job_3 + job_4 + job_5 +
## job_6 + divorced + single + edu_primary + edu_sec + edu_tert +
## co_cellular + co_tel + month_1 + month_2 + month_3 + month_4 +
## month_5 + month_6 + poc_success + poc_failure + poc_other
##
## Df Deviance AIC
## - pdays 1 10730 10794
## - job_2 1 10731 10795
## - previous 1 10731 10795
## - age 1 10731 10795
## - job_4 1 10731 10795
## - edu_sec 1 10731 10795
## - edu_primary 1 10731 10795
## - job_6 1 10731 10795
## - job_1 1 10732 10796
## - poc_other 1 10732 10796
## - edu_tert 1 10732 10796
## <none> 10730 10796
## - day 1 10733 10797
## - job_3 1 10734 10798
## - balance 1 10734 10798
## - job_5 1 10734 10798
## - divorced 1 10739 10803
## - poc_failure 1 10740 10804
## - loan 1 10741 10805
## - campaign 1 10746 10810
## - single 1 10746 10810
## - month_2 1 10749 10813
## - month_6 1 10762 10826
## - month_5 1 10784 10848
## - co_tel 1 10785 10849
## - month_1 1 10785 10849
## - housing 1 10789 10853
## - month_4 1 10795 10859
## - co_cellular 1 10797 10861
## - poc_success 1 10906 10970
## - month_3 1 10911 10975
## - ID 1 11575 11639
## - duration 1 14016 14080
##
## Step: AIC=10794.5
## y ~ age + balance + housing + loan + day + duration + campaign +
## previous + ID + job_1 + job_2 + job_3 + job_4 + job_5 + job_6 +
## divorced + single + edu_primary + edu_sec + edu_tert + co_cellular +
## co_tel + month_1 + month_2 + month_3 + month_4 + month_5 +
## month_6 + poc_success + poc_failure + poc_other
##
## Df Deviance AIC
## - job_2 1 10731 10793
## - previous 1 10731 10793
## - age 1 10731 10793
## - job_4 1 10731 10793
## - edu_sec 1 10731 10793
## - edu_primary 1 10731 10793
## - job_6 1 10731 10793
## - job_1 1 10732 10794
## - edu_tert 1 10732 10794
## <none> 10730 10794
## - poc_other 1 10733 10795
## - day 1 10733 10795
## - job_3 1 10734 10796
## - balance 1 10734 10796
## - job_5 1 10734 10796
## - divorced 1 10739 10801
## - loan 1 10741 10803
## - single 1 10746 10808
## - campaign 1 10746 10808
## - month_2 1 10749 10811
## - poc_failure 1 10753 10815
## - month_6 1 10762 10824
## - month_5 1 10784 10846
## - co_tel 1 10785 10847
## - month_1 1 10786 10848
## - housing 1 10790 10852
## - month_4 1 10796 10858
## - co_cellular 1 10797 10859
## - month_3 1 10912 10974
## - poc_success 1 10976 11038
## - ID 1 11576 11638
## - duration 1 14016 14078
##
## Step: AIC=10792.75
## y ~ age + balance + housing + loan + day + duration + campaign +
## previous + ID + job_1 + job_3 + job_4 + job_5 + job_6 + divorced +
## single + edu_primary + edu_sec + edu_tert + co_cellular +
## co_tel + month_1 + month_2 + month_3 + month_4 + month_5 +
## month_6 + poc_success + poc_failure + poc_other
##
## Df Deviance AIC
## - age 1 10731 10791
## - previous 1 10731 10791
## - edu_sec 1 10731 10791
## - edu_primary 1 10731 10791
## - job_4 1 10731 10791
## - job_6 1 10731 10791
## - edu_tert 1 10732 10792
## - job_1 1 10732 10792
## <none> 10731 10793
## - poc_other 1 10733 10793
## - day 1 10733 10793
## - balance 1 10734 10794
## - job_3 1 10735 10795
## - job_5 1 10736 10796
## - divorced 1 10739 10799
## - loan 1 10741 10801
## - single 1 10746 10806
## - campaign 1 10746 10806
## - month_2 1 10749 10809
## - poc_failure 1 10754 10814
## - month_6 1 10762 10822
## - month_5 1 10784 10844
## - co_tel 1 10785 10845
## - month_1 1 10786 10846
## - housing 1 10790 10850
## - month_4 1 10796 10856
## - co_cellular 1 10797 10857
## - month_3 1 10912 10972
## - poc_success 1 10976 11036
## - ID 1 11577 11637
## - duration 1 14016 14076
##
## Step: AIC=10791.07
## y ~ balance + housing + loan + day + duration + campaign + previous +
## ID + job_1 + job_3 + job_4 + job_5 + job_6 + divorced + single +
## edu_primary + edu_sec + edu_tert + co_cellular + co_tel +
## month_1 + month_2 + month_3 + month_4 + month_5 + month_6 +
## poc_success + poc_failure + poc_other
##
## Df Deviance AIC
## - previous 1 10731 10789
## - edu_sec 1 10732 10790
## - job_4 1 10732 10790
## - edu_primary 1 10732 10790
## - job_6 1 10732 10790
## - edu_tert 1 10732 10790
## - job_1 1 10733 10791
## <none> 10731 10791
## - poc_other 1 10733 10791
## - day 1 10734 10792
## - balance 1 10735 10793
## - job_3 1 10736 10794
## - job_5 1 10739 10797
## - divorced 1 10740 10798
## - loan 1 10741 10799
## - campaign 1 10747 10805
## - single 1 10748 10806
## - month_2 1 10750 10808
## - poc_failure 1 10754 10812
## - month_6 1 10762 10820
## - month_5 1 10784 10842
## - co_tel 1 10785 10843
## - month_1 1 10787 10845
## - housing 1 10792 10850
## - month_4 1 10797 10855
## - co_cellular 1 10797 10855
## - month_3 1 10913 10971
## - poc_success 1 10977 11035
## - ID 1 11579 11637
## - duration 1 14017 14075
##
## Step: AIC=10789.43
## y ~ balance + housing + loan + day + duration + campaign + ID +
## job_1 + job_3 + job_4 + job_5 + job_6 + divorced + single +
## edu_primary + edu_sec + edu_tert + co_cellular + co_tel +
## month_1 + month_2 + month_3 + month_4 + month_5 + month_6 +
## poc_success + poc_failure + poc_other
##
## Df Deviance AIC
## - edu_sec 1 10732 10788
## - job_4 1 10732 10788
## - edu_primary 1 10732 10788
## - job_6 1 10732 10788
## - edu_tert 1 10733 10789
## - job_1 1 10733 10789
## <none> 10731 10789
## - poc_other 1 10734 10790
## - day 1 10734 10790
## - balance 1 10735 10791
## - job_3 1 10736 10792
## - job_5 1 10740 10796
## - divorced 1 10740 10796
## - loan 1 10742 10798
## - campaign 1 10747 10803
## - single 1 10748 10804
## - month_2 1 10750 10806
## - poc_failure 1 10755 10811
## - month_6 1 10763 10819
## - month_5 1 10785 10841
## - co_tel 1 10786 10842
## - month_1 1 10787 10843
## - housing 1 10792 10848
## - month_4 1 10797 10853
## - co_cellular 1 10798 10854
## - month_3 1 10913 10969
## - poc_success 1 11002 11058
## - ID 1 11580 11636
## - duration 1 14018 14074
##
## Step: AIC=10787.89
## y ~ balance + housing + loan + day + duration + campaign + ID +
## job_1 + job_3 + job_4 + job_5 + job_6 + divorced + single +
## edu_primary + edu_tert + co_cellular + co_tel + month_1 +
## month_2 + month_3 + month_4 + month_5 + month_6 + poc_success +
## poc_failure + poc_other
##
## Df Deviance AIC
## - job_4 1 10732 10786
## - job_6 1 10733 10787
## - edu_tert 1 10733 10787
## - job_1 1 10734 10788
## - poc_other 1 10734 10788
## <none> 10732 10788
## - day 1 10734 10788
## - balance 1 10736 10790
## - job_3 1 10736 10790
## - edu_primary 1 10737 10791
## - job_5 1 10740 10794
## - divorced 1 10740 10794
## - loan 1 10742 10796
## - campaign 1 10747 10801
## - single 1 10749 10803
## - month_2 1 10750 10804
## - poc_failure 1 10756 10810
## - month_6 1 10763 10817
## - month_5 1 10785 10839
## - co_tel 1 10786 10840
## - month_1 1 10788 10842
## - housing 1 10792 10846
## - month_4 1 10798 10852
## - co_cellular 1 10798 10852
## - month_3 1 10914 10968
## - poc_success 1 11002 11056
## - ID 1 11580 11634
## - duration 1 14018 14072
##
## Step: AIC=10786.33
## y ~ balance + housing + loan + day + duration + campaign + ID +
## job_1 + job_3 + job_5 + job_6 + divorced + single + edu_primary +
## edu_tert + co_cellular + co_tel + month_1 + month_2 + month_3 +
## month_4 + month_5 + month_6 + poc_success + poc_failure +
## poc_other
##
## Df Deviance AIC
## - job_6 1 10733 10785
## - edu_tert 1 10734 10786
## - job_1 1 10734 10786
## - poc_other 1 10734 10786
## <none> 10732 10786
## - day 1 10735 10787
## - balance 1 10736 10788
## - job_3 1 10736 10788
## - edu_primary 1 10737 10789
## - job_5 1 10740 10792
## - divorced 1 10741 10793
## - loan 1 10743 10795
## - campaign 1 10748 10800
## - month_2 1 10751 10803
## - single 1 10751 10803
## - poc_failure 1 10756 10808
## - month_6 1 10764 10816
## - month_5 1 10786 10838
## - co_tel 1 10786 10838
## - month_1 1 10788 10840
## - housing 1 10795 10847
## - month_4 1 10799 10851
## - co_cellular 1 10799 10851
## - month_3 1 10915 10967
## - poc_success 1 11003 11055
## - ID 1 11590 11642
## - duration 1 14018 14070
##
## Step: AIC=10785.15
## y ~ balance + housing + loan + day + duration + campaign + ID +
## job_1 + job_3 + job_5 + divorced + single + edu_primary +
## edu_tert + co_cellular + co_tel + month_1 + month_2 + month_3 +
## month_4 + month_5 + month_6 + poc_success + poc_failure +
## poc_other
##
## Df Deviance AIC
## - edu_tert 1 10734 10784
## - poc_other 1 10735 10785
## - job_1 1 10735 10785
## <none> 10733 10785
## - day 1 10736 10786
## - balance 1 10737 10787
## - job_3 1 10738 10788
## - edu_primary 1 10738 10788
## - job_5 1 10742 10792
## - divorced 1 10742 10792
## - loan 1 10743 10793
## - campaign 1 10749 10799
## - month_2 1 10752 10802
## - single 1 10752 10802
## - poc_failure 1 10757 10807
## - month_6 1 10764 10814
## - month_5 1 10787 10837
## - co_tel 1 10787 10837
## - month_1 1 10789 10839
## - housing 1 10795 10845
## - month_4 1 10799 10849
## - co_cellular 1 10800 10850
## - month_3 1 10916 10966
## - poc_success 1 11003 11053
## - ID 1 11590 11640
## - duration 1 14018 14068
##
## Step: AIC=10784.24
## y ~ balance + housing + loan + day + duration + campaign + ID +
## job_1 + job_3 + job_5 + divorced + single + edu_primary +
## co_cellular + co_tel + month_1 + month_2 + month_3 + month_4 +
## month_5 + month_6 + poc_success + poc_failure + poc_other
##
## Df Deviance AIC
## - poc_other 1 10736 10784
## <none> 10734 10784
## - day 1 10737 10785
## - job_1 1 10737 10785
## - balance 1 10738 10786
## - edu_primary 1 10740 10788
## - divorced 1 10743 10791
## - job_5 1 10743 10791
## - loan 1 10745 10793
## - job_3 1 10746 10794
## - campaign 1 10750 10798
## - month_2 1 10753 10801
## - single 1 10754 10802
## - poc_failure 1 10758 10806
## - month_6 1 10765 10813
## - month_5 1 10788 10836
## - co_tel 1 10789 10837
## - month_1 1 10790 10838
## - housing 1 10796 10844
## - co_cellular 1 10800 10848
## - month_4 1 10801 10849
## - month_3 1 10918 10966
## - poc_success 1 11005 11053
## - ID 1 11594 11642
## - duration 1 14018 14066
##
## Step: AIC=10784.19
## y ~ balance + housing + loan + day + duration + campaign + ID +
## job_1 + job_3 + job_5 + divorced + single + edu_primary +
## co_cellular + co_tel + month_1 + month_2 + month_3 + month_4 +
## month_5 + month_6 + poc_success + poc_failure
##
## Df Deviance AIC
## <none> 10736 10784
## - day 1 10738 10784
## - job_1 1 10739 10785
## - balance 1 10740 10786
## - edu_primary 1 10742 10788
## - divorced 1 10745 10791
## - job_5 1 10745 10791
## - loan 1 10747 10793
## - job_3 1 10748 10794
## - campaign 1 10752 10798
## - month_2 1 10754 10800
## - single 1 10756 10802
## - poc_failure 1 10758 10804
## - month_6 1 10767 10813
## - month_5 1 10790 10836
## - co_tel 1 10790 10836
## - month_1 1 10792 10838
## - housing 1 10801 10847
## - co_cellular 1 10802 10848
## - month_4 1 10803 10849
## - month_3 1 10919 10965
## - poc_success 1 11023 11069
## - ID 1 11614 11660
## - duration 1 14020 14066
summary(fit)
##
## Call:
## glm(formula = y ~ balance + housing + loan + day + duration +
## campaign + ID + job_1 + job_3 + job_5 + divorced + single +
## edu_primary + co_cellular + co_tel + month_1 + month_2 +
## month_3 + month_4 + month_5 + month_6 + poc_success + poc_failure,
## family = "binomial", data = train_75)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.8432 -0.3511 -0.2118 -0.1153 3.2759
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.173e+00 1.509e-01 -40.905 < 2e-16 ***
## balance 1.340e-05 6.783e-06 1.976 0.048156 *
## housing -4.778e-01 5.975e-02 -7.996 1.29e-15 ***
## loan -2.631e-01 8.294e-02 -3.173 0.001510 **
## day 4.830e-03 3.215e-03 1.502 0.133033
## duration 4.639e-03 9.499e-05 48.831 < 2e-16 ***
## campaign -5.233e-02 1.388e-02 -3.770 0.000163 ***
## ID 1.018e-04 3.471e-06 29.322 < 2e-16 ***
## job_1 1.101e-01 6.854e-02 1.606 0.108177
## job_3 2.267e-01 6.556e-02 3.458 0.000543 ***
## job_5 3.180e-01 1.041e-01 3.053 0.002265 **
## divorced 2.439e-01 8.239e-02 2.960 0.003072 **
## single 2.590e-01 5.723e-02 4.526 6.00e-06 ***
## edu_primary -2.053e-01 8.384e-02 -2.449 0.014316 *
## co_cellular -8.516e-01 1.027e-01 -8.291 < 2e-16 ***
## co_tel -1.026e+00 1.411e-01 -7.269 3.62e-13 ***
## month_1 5.612e-01 7.585e-02 7.399 1.37e-13 ***
## month_2 5.963e-01 1.380e-01 4.320 1.56e-05 ***
## month_3 2.270e+00 1.641e-01 13.834 < 2e-16 ***
## month_4 1.145e+00 1.388e-01 8.254 < 2e-16 ***
## month_5 7.157e-01 9.677e-02 7.397 1.40e-13 ***
## month_6 6.356e-01 1.131e-01 5.621 1.89e-08 ***
## poc_success 1.545e+00 9.288e-02 16.633 < 2e-16 ***
## poc_failure -3.617e-01 7.798e-02 -4.638 3.51e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 17389 on 23734 degrees of freedom
## Residual deviance: 10736 on 23711 degrees of freedom
## AIC: 10784
##
## Number of Fisher Scoring iterations: 6
#lets start drop variables based on p values
formula(fit)
## y ~ balance + housing + loan + day + duration + campaign + ID +
## job_1 + job_3 + job_5 + divorced + single + edu_primary +
## co_cellular + co_tel + month_1 + month_2 + month_3 + month_4 +
## month_5 + month_6 + poc_success + poc_failure
#lets check the remaining significant variables
names(fit$coefficients)
## [1] "(Intercept)" "balance" "housing" "loan" "day"
## [6] "duration" "campaign" "ID" "job_1" "job_3"
## [11] "job_5" "divorced" "single" "edu_primary" "co_cellular"
## [16] "co_tel" "month_1" "month_2" "month_3" "month_4"
## [21] "month_5" "month_6" "poc_success" "poc_failure"
#lets build final logistic model on significant variables on dataset train_75
fit_final=glm(y ~ balance + housing + loan + duration + campaign + ID +
job_3 + job_5 + divorced + single + edu_primary +
co_cellular + co_tel + month_1 + month_2 + month_3 + month_4 +
month_5 + month_6 + poc_success + poc_failure,
data=train_75,family="binomial")
summary(fit_final)
##
## Call:
## glm(formula = y ~ balance + housing + loan + duration + campaign +
## ID + job_3 + job_5 + divorced + single + edu_primary + co_cellular +
## co_tel + month_1 + month_2 + month_3 + month_4 + month_5 +
## month_6 + poc_success + poc_failure, family = "binomial",
## data = train_75)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.8400 -0.3506 -0.2118 -0.1152 3.2747
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.063e+00 1.406e-01 -43.133 < 2e-16 ***
## balance 1.360e-05 6.765e-06 2.010 0.04444 *
## housing -4.867e-01 5.962e-02 -8.163 3.28e-16 ***
## loan -2.664e-01 8.289e-02 -3.214 0.00131 **
## duration 4.633e-03 9.493e-05 48.810 < 2e-16 ***
## campaign -5.025e-02 1.383e-02 -3.633 0.00028 ***
## ID 1.010e-04 3.449e-06 29.298 < 2e-16 ***
## job_3 1.908e-01 6.148e-02 3.103 0.00192 **
## job_5 2.922e-01 1.025e-01 2.851 0.00436 **
## divorced 2.467e-01 8.237e-02 2.995 0.00275 **
## single 2.621e-01 5.717e-02 4.585 4.54e-06 ***
## edu_primary -2.274e-01 8.249e-02 -2.756 0.00585 **
## co_cellular -8.305e-01 1.022e-01 -8.130 4.28e-16 ***
## co_tel -1.009e+00 1.408e-01 -7.166 7.72e-13 ***
## month_1 5.700e-01 7.577e-02 7.523 5.37e-14 ***
## month_2 5.917e-01 1.379e-01 4.292 1.77e-05 ***
## month_3 2.266e+00 1.638e-01 13.836 < 2e-16 ***
## month_4 1.167e+00 1.383e-01 8.437 < 2e-16 ***
## month_5 7.328e-01 9.600e-02 7.633 2.30e-14 ***
## month_6 5.994e-01 1.099e-01 5.453 4.95e-08 ***
## poc_success 1.548e+00 9.286e-02 16.676 < 2e-16 ***
## poc_failure -3.620e-01 7.792e-02 -4.646 3.38e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 17389 on 23734 degrees of freedom
## Residual deviance: 10741 on 23713 degrees of freedom
## AIC: 10785
##
## Number of Fisher Scoring iterations: 6
Thus logistic regression model is successfully built.
Now lets predict scores
library(pROC)
score=predict(fit_final,newdata =test_25,type = "response")
check the performance using auc score
#Thus area under the ROC curve is:
roccurve=roc(test_25$y,score) #real outcome and predicted score
auc(roccurve)
## Area under the curve: 0.9212
Area under the curve: 0.9212. Higher the AUC better the model
Modelled probability is P(y=1) by default. Meaning, score should be high when outcome is 1 and low when otucome it 0
Lets visualise how is our eventual binary response is behaving w.r.t. score that we obtained
library(ggplot2)
mydata=data.frame(Actual=test_25$y,Predicted=score)
ggplot(mydata,aes(y=Actual,x=Predicted,color=factor(test_25$y)))+
geom_point()+geom_jitter()
You can see that response 0 is bunched around low scores and response 1 is bunched around high scores, However there is overlap as well across score values. We need to find a cutoff in this score if we need to predict hard classes.
Lets build model on entire train data
# so the tentative score performance of logistic regression is going to be around 0.9212
# now lets build the model on entire training data
library(car)
for_vif_final=lm(y~.,data=train)
summary(for_vif_final)
##
## Call:
## lm(formula = y ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.06429 -0.11770 -0.03423 0.03642 1.04121
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.505e-01 1.283e-02 -11.729 < 2e-16 ***
## age 3.015e-04 1.837e-04 1.641 0.100800
## default 4.211e-03 1.143e-02 0.368 0.712674
## balance 1.054e-06 5.003e-07 2.106 0.035204 *
## housing -3.438e-02 3.555e-03 -9.672 < 2e-16 ***
## loan -1.291e-02 4.153e-03 -3.110 0.001874 **
## day 4.884e-04 1.926e-04 2.536 0.011233 *
## duration 4.797e-04 5.887e-06 81.488 < 2e-16 ***
## campaign -2.866e-04 5.015e-04 -0.571 0.567727
## pdays -7.456e-05 3.220e-05 -2.316 0.020582 *
## previous 4.162e-04 7.179e-04 0.580 0.562084
## ID 7.404e-06 2.186e-07 33.867 < 2e-16 ***
## job_1 3.799e-03 4.439e-03 0.856 0.392043
## job_2 -4.361e-03 4.689e-03 -0.930 0.352264
## job_3 1.029e-02 5.398e-03 1.907 0.056592 .
## job_4 5.558e-02 1.129e-02 4.924 8.52e-07 ***
## job_5 2.802e-02 8.136e-03 3.444 0.000574 ***
## job_6 -4.295e-03 9.344e-03 -0.460 0.645792
## divorced 1.394e-02 4.858e-03 2.870 0.004103 **
## single 1.722e-02 3.834e-03 4.492 7.08e-06 ***
## edu_primary -6.400e-03 8.419e-03 -0.760 0.447196
## edu_sec 4.772e-03 7.748e-03 0.616 0.538009
## edu_tert 1.072e-02 8.298e-03 1.292 0.196520
## co_cellular -9.688e-02 5.808e-03 -16.682 < 2e-16 ***
## co_tel -1.092e-01 8.066e-03 -13.535 < 2e-16 ***
## month_1 2.368e-02 4.220e-03 5.610 2.04e-08 ***
## month_2 1.007e-01 1.240e-02 8.115 5.03e-16 ***
## month_3 3.088e-01 1.525e-02 20.248 < 2e-16 ***
## month_4 1.630e-01 1.257e-02 12.967 < 2e-16 ***
## month_5 3.982e-02 6.821e-03 5.837 5.35e-09 ***
## month_6 3.573e-02 7.516e-03 4.754 2.00e-06 ***
## poc_success 3.651e-01 1.063e-02 34.353 < 2e-16 ***
## poc_failure -2.516e-02 9.567e-03 -2.630 0.008552 **
## poc_other -7.933e-03 1.110e-02 -0.715 0.474773
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2656 on 31613 degrees of freedom
## Multiple R-squared: 0.3205, Adjusted R-squared: 0.3198
## F-statistic: 451.9 on 33 and 31613 DF, p-value: < 2.2e-16
sort(vif(for_vif_final),decreasing = T)[1:3]
## edu_sec edu_tert pdays
## 6.725263 6.376653 4.658624
#So remove edu sec from train
for_vif_final=lm(y~.-edu_sec,data=train)
sort(vif(for_vif_final),decreasing = T)[1:3]
## pdays poc_failure ID
## 4.658565 3.929525 3.659703
summary(for_vif_final)
##
## Call:
## lm(formula = y ~ . - edu_sec, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.06369 -0.11762 -0.03423 0.03654 1.04150
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.458e-01 1.027e-02 -14.198 < 2e-16 ***
## age 2.911e-04 1.829e-04 1.591 0.111552
## default 4.193e-03 1.143e-02 0.367 0.713794
## balance 1.054e-06 5.003e-07 2.107 0.035125 *
## housing -3.431e-02 3.553e-03 -9.656 < 2e-16 ***
## loan -1.279e-02 4.148e-03 -3.083 0.002048 **
## day 4.882e-04 1.926e-04 2.534 0.011271 *
## duration 4.797e-04 5.887e-06 81.488 < 2e-16 ***
## campaign -2.863e-04 5.015e-04 -0.571 0.568116
## pdays -7.464e-05 3.220e-05 -2.318 0.020460 *
## previous 4.169e-04 7.179e-04 0.581 0.561378
## ID 7.402e-06 2.186e-07 33.863 < 2e-16 ***
## job_1 3.789e-03 4.439e-03 0.854 0.393337
## job_2 -4.341e-03 4.688e-03 -0.926 0.354497
## job_3 1.016e-02 5.393e-03 1.884 0.059612 .
## job_4 5.490e-02 1.123e-02 4.887 1.03e-06 ***
## job_5 2.817e-02 8.132e-03 3.464 0.000532 ***
## job_6 -4.206e-03 9.343e-03 -0.450 0.652607
## divorced 1.404e-02 4.855e-03 2.891 0.003844 **
## single 1.719e-02 3.834e-03 4.484 7.35e-06 ***
## edu_primary -1.078e-02 4.507e-03 -2.391 0.016789 *
## edu_tert 6.368e-03 4.357e-03 1.462 0.143829
## co_cellular -9.676e-02 5.804e-03 -16.671 < 2e-16 ***
## co_tel -1.091e-01 8.066e-03 -13.528 < 2e-16 ***
## month_1 2.367e-02 4.220e-03 5.609 2.05e-08 ***
## month_2 1.006e-01 1.240e-02 8.109 5.31e-16 ***
## month_3 3.088e-01 1.525e-02 20.246 < 2e-16 ***
## month_4 1.630e-01 1.257e-02 12.964 < 2e-16 ***
## month_5 3.982e-02 6.821e-03 5.837 5.36e-09 ***
## month_6 3.572e-02 7.516e-03 4.752 2.02e-06 ***
## poc_success 3.651e-01 1.063e-02 34.356 < 2e-16 ***
## poc_failure -2.513e-02 9.567e-03 -2.627 0.008626 **
## poc_other -7.876e-03 1.110e-02 -0.710 0.477961
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2656 on 31614 degrees of freedom
## Multiple R-squared: 0.3205, Adjusted R-squared: 0.3198
## F-statistic: 466 on 32 and 31614 DF, p-value: < 2.2e-16
sort(vif(for_vif_final),decreasing = T)[1:3]
## pdays poc_failure ID
## 4.658565 3.929525 3.659703
#Build model
fit_final_model=glm(y~.,data=train, family = "binomial") #32 predictor var
summary(fit_final_model)
##
## Call:
## glm(formula = y ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.7192 -0.3504 -0.2138 -0.1200 3.2289
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.118e+00 2.035e-01 -30.056 < 2e-16 ***
## age 2.818e-04 2.655e-03 0.106 0.915465
## default 8.571e-02 2.144e-01 0.400 0.689298
## balance 1.257e-05 6.194e-06 2.029 0.042491 *
## housing -4.805e-01 5.325e-02 -9.024 < 2e-16 ***
## loan -2.281e-01 7.189e-02 -3.174 0.001506 **
## day 5.077e-03 2.797e-03 1.815 0.069566 .
## duration 4.533e-03 8.163e-05 55.532 < 2e-16 ***
## campaign -5.061e-02 1.201e-02 -4.214 2.51e-05 ***
## pdays -2.162e-04 3.476e-04 -0.622 0.534060
## previous 3.051e-03 7.613e-03 0.401 0.688618
## ID 1.008e-04 3.054e-06 32.999 < 2e-16 ***
## job_1 4.875e-02 6.734e-02 0.724 0.469078
## job_2 -1.003e-01 7.676e-02 -1.306 0.191408
## job_3 1.323e-01 7.764e-02 1.704 0.088465 .
## job_4 2.205e-01 1.224e-01 1.802 0.071521 .
## job_5 1.939e-01 1.100e-01 1.762 0.078006 .
## job_6 -1.227e-01 1.309e-01 -0.937 0.348648
## divorced 2.545e-01 7.199e-02 3.535 0.000408 ***
## single 2.187e-01 5.618e-02 3.892 9.94e-05 ***
## edu_primary -1.735e-01 1.246e-01 -1.393 0.163677
## edu_sec 1.073e-01 1.094e-01 0.981 0.326478
## edu_tert 1.661e-01 1.148e-01 1.447 0.147832
## co_cellular -8.475e-01 8.822e-02 -9.606 < 2e-16 ***
## co_tel -1.047e+00 1.227e-01 -8.534 < 2e-16 ***
## month_1 4.837e-01 6.624e-02 7.302 2.83e-13 ***
## month_2 5.894e-01 1.197e-01 4.923 8.54e-07 ***
## month_3 2.152e+00 1.420e-01 15.154 < 2e-16 ***
## month_4 1.016e+00 1.217e-01 8.345 < 2e-16 ***
## month_5 6.530e-01 8.433e-02 7.743 9.71e-15 ***
## month_6 6.478e-01 9.789e-02 6.617 3.66e-11 ***
## poc_success 1.560e+00 1.019e-01 15.318 < 2e-16 ***
## poc_failure -3.775e-01 1.087e-01 -3.474 0.000512 ***
## poc_other -2.221e-01 1.256e-01 -1.768 0.077107 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 22913 on 31646 degrees of freedom
## Residual deviance: 14281 on 31613 degrees of freedom
## AIC: 14349
##
## Number of Fisher Scoring iterations: 6
#Remove variable having p>0.5 one by one
#fit=step(fit_final_model)
summary(fit_final_model)
##
## Call:
## glm(formula = y ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.7192 -0.3504 -0.2138 -0.1200 3.2289
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.118e+00 2.035e-01 -30.056 < 2e-16 ***
## age 2.818e-04 2.655e-03 0.106 0.915465
## default 8.571e-02 2.144e-01 0.400 0.689298
## balance 1.257e-05 6.194e-06 2.029 0.042491 *
## housing -4.805e-01 5.325e-02 -9.024 < 2e-16 ***
## loan -2.281e-01 7.189e-02 -3.174 0.001506 **
## day 5.077e-03 2.797e-03 1.815 0.069566 .
## duration 4.533e-03 8.163e-05 55.532 < 2e-16 ***
## campaign -5.061e-02 1.201e-02 -4.214 2.51e-05 ***
## pdays -2.162e-04 3.476e-04 -0.622 0.534060
## previous 3.051e-03 7.613e-03 0.401 0.688618
## ID 1.008e-04 3.054e-06 32.999 < 2e-16 ***
## job_1 4.875e-02 6.734e-02 0.724 0.469078
## job_2 -1.003e-01 7.676e-02 -1.306 0.191408
## job_3 1.323e-01 7.764e-02 1.704 0.088465 .
## job_4 2.205e-01 1.224e-01 1.802 0.071521 .
## job_5 1.939e-01 1.100e-01 1.762 0.078006 .
## job_6 -1.227e-01 1.309e-01 -0.937 0.348648
## divorced 2.545e-01 7.199e-02 3.535 0.000408 ***
## single 2.187e-01 5.618e-02 3.892 9.94e-05 ***
## edu_primary -1.735e-01 1.246e-01 -1.393 0.163677
## edu_sec 1.073e-01 1.094e-01 0.981 0.326478
## edu_tert 1.661e-01 1.148e-01 1.447 0.147832
## co_cellular -8.475e-01 8.822e-02 -9.606 < 2e-16 ***
## co_tel -1.047e+00 1.227e-01 -8.534 < 2e-16 ***
## month_1 4.837e-01 6.624e-02 7.302 2.83e-13 ***
## month_2 5.894e-01 1.197e-01 4.923 8.54e-07 ***
## month_3 2.152e+00 1.420e-01 15.154 < 2e-16 ***
## month_4 1.016e+00 1.217e-01 8.345 < 2e-16 ***
## month_5 6.530e-01 8.433e-02 7.743 9.71e-15 ***
## month_6 6.478e-01 9.789e-02 6.617 3.66e-11 ***
## poc_success 1.560e+00 1.019e-01 15.318 < 2e-16 ***
## poc_failure -3.775e-01 1.087e-01 -3.474 0.000512 ***
## poc_other -2.221e-01 1.256e-01 -1.768 0.077107 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 22913 on 31646 degrees of freedom
## Residual deviance: 14281 on 31613 degrees of freedom
## AIC: 14349
##
## Number of Fisher Scoring iterations: 6
#lets start drop variables based on p values
#formula(fit_final_model)
#Now based on this summery result remove variables (i.e dont add)having pi value >0.05.
fit_final_model=glm(y ~ balance + housing + loan + duration +
campaign + pdays + ID + job_3 +
job_5 + divorced + single + edu_primary +
co_cellular + co_tel + month_1 + month_2 + month_3 + month_4 +
month_5 + month_6 + poc_success + poc_failure ,
data=train,family="binomial")
summary(fit_final_model)
##
## Call:
## glm(formula = y ~ balance + housing + loan + duration + campaign +
## pdays + ID + job_3 + job_5 + divorced + single + edu_primary +
## co_cellular + co_tel + month_1 + month_2 + month_3 + month_4 +
## month_5 + month_6 + poc_success + poc_failure, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.7197 -0.3519 -0.2140 -0.1196 3.2387
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.920e+00 1.203e-01 -49.226 < 2e-16 ***
## balance 1.273e-05 6.159e-06 2.067 0.038769 *
## housing -4.930e-01 5.225e-02 -9.436 < 2e-16 ***
## loan -2.309e-01 7.146e-02 -3.232 0.001231 **
## duration 4.521e-03 8.147e-05 55.498 < 2e-16 ***
## campaign -4.789e-02 1.195e-02 -4.008 6.11e-05 ***
## pdays -6.003e-04 2.726e-04 -2.202 0.027661 *
## ID 1.001e-04 3.000e-06 33.362 < 2e-16 ***
## job_3 1.649e-01 5.340e-02 3.087 0.002019 **
## job_5 2.055e-01 8.914e-02 2.306 0.021115 *
## divorced 2.579e-01 7.169e-02 3.597 0.000322 ***
## single 2.503e-01 4.948e-02 5.059 4.20e-07 ***
## edu_primary -3.009e-01 7.264e-02 -4.143 3.43e-05 ***
## co_cellular -8.296e-01 8.774e-02 -9.455 < 2e-16 ***
## co_tel -1.034e+00 1.216e-01 -8.506 < 2e-16 ***
## month_1 4.856e-01 6.593e-02 7.365 1.77e-13 ***
## month_2 5.814e-01 1.195e-01 4.866 1.14e-06 ***
## month_3 2.151e+00 1.416e-01 15.189 < 2e-16 ***
## month_4 1.032e+00 1.209e-01 8.531 < 2e-16 ***
## month_5 6.687e-01 8.358e-02 8.000 1.24e-15 ***
## month_6 5.928e-01 9.448e-02 6.274 3.51e-10 ***
## poc_success 1.653e+00 8.839e-02 18.696 < 2e-16 ***
## poc_failure -2.662e-01 8.724e-02 -3.051 0.002280 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 22913 on 31646 degrees of freedom
## Residual deviance: 14298 on 31624 degrees of freedom
## AIC: 14344
##
## Number of Fisher Scoring iterations: 6
now if we needed to submit probability scores for the test data we are done at this point We write type = reponse to get probabilities score
test.prob.score= predict(fit_final_model,newdata = test,type='response')
write.csv(test.prob.score,"Neha_Raut_Probabilities.csv",row.names = F)
lets find cutoff based on these probability scores.
You can find cutoff using prediction and performance parameters or using KS
1.using prediction and performance parameters
train.score=predict(fit_final_model,newdata = train,type="response")
real=train$y
#Deciding cutoff
library(ROCR)
#We will need two paramtere : prediction and performance
ROCRPred=prediction(train.score,train$y)
ROCRPref=performance(ROCRPred,"tpr","fpr") #true positive and false positive
plot(ROCRPref,colorize=TRUE,print.cutoffs.at=seq(0.1,by=0.1)) #cutoff comes 0.1
#OR
#plot(ROCRPref,colorize=TRUE,print.cutoffs.at=seq(0.1,1,by=100))
#OR
res.roc <- roc(train$y,train.score)
coords(res.roc, "best")
## threshold specificity sensitivity
## 0.1103465 0.8310595 0.8787634
plot.roc(res.roc, print.auc = TRUE,print.thres = "best") #cutoff comes 0.110
Creating confusion matrix and find how good our model is (by predicting on test_25 dataset)
#Try for cutoff 0.110
table(ActualValue=test_25$y,predictedValue=score>0.110)
## predictedValue
## ActualValue FALSE TRUE
## 0 5877 1156
## 1 108 771
Accuracy=(771+5877)/(771+5877+108+ 1156)
Accuracy#0.8402427
## [1] 0.8402427
table(ActualValue=test_25$y,predictedValue=score>0.3)
## predictedValue
## ActualValue FALSE TRUE
## 0 6616 417
## 1 345 534
#Accuracy=(534+6616)/(534+6616+345+417)
#Accuracy#0.9036906
table(ActualValue=test_25$y,predictedValue=score>0.5)
## predictedValue
## ActualValue FALSE TRUE
## 0 6843 190
## 1 525 354
Accuracy=(6843+354)/(6843+354+525+190)
#Accuracy#0.9096309
from above we can see that TN is low for 0.1, and high for 0.3 and 0.5 hence our cutoff is right
TP=771
FP=1156
P=1927
TN=5877
FN=108
N=5985
Accuracy=(TP+TN)/(P+N)
Accuracy
## [1] 0.8402427
Sn=TP/P
Sp=TN/N #specificity
KS=(TP/P)-(FP/N)
Precision=TP/(TP+FP)
Recall=TP/P
2. Using KS Method
we’ll start with calculating proabbility scores on training data and making a base data with single obs where we’ll store our values from the for loop
train.score=predict(fit_final_model,newdata = train,type="response")
real=train$y
cutoff_data=data.frame(cutoff=0,TP=0,FP=0,FN=0,TN=0)
cutoffs=seq(0,1,length=100)
We’ll go through all the cutoffs and for each we’ll store the calculated values
for (i in cutoffs){
predicted=as.numeric(train.score>i)
TP=sum(predicted==1 & train$y==1)
FP=sum(predicted==1 & train$y==0)
FN=sum(predicted==0 & train$y==1)
TN=sum(predicted==0 & train$y==0)
cutoff_data=rbind(cutoff_data,c(i,TP,FP,FN,TN))
}
## lets remove the dummy data cotaining top row in data frame cutoff_data
cutoff_data=cutoff_data[-1,]
#we now have 100 obs in df cutoff_data
lets calculate the performance measures:sensitivity,specificity,accuracy, KS and precision.
cutoff_data=cutoff_data %>%
mutate(P=FN+TP,N=TN+FP, #total positives and negatives
Sn=TP/P, #sensitivity
Sp=TN/N, #specificity
KS=abs((TP/P)-(FP/N)),
Accuracy=(TP+TN)/(P+N),
Precision=TP/(TP+FP),
Recall=TP/P
) %>%
select(-P,-N)
lets view cutoff dataset:
#View(cutoff_data)
visualise how these measures(Individual Values) move across cutoffs
ggplot(cutoff_data,aes(x=cutoff,y=Sp))+geom_line()
ggplot(cutoff_data,aes(x=cutoff,y=Sn))+geom_line()
ggplot(cutoff_data,aes(x=cutoff,y=KS))+geom_line()
#If you want to look at all of this
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.4
library(ggplot2)
cutoff_long=cutoff_data %>%
gather(Measure,Value,KS,Sn,Sp)
ggplot(cutoff_long,aes(x=cutoff,y=Value,color=Measure))+geom_line()
Lets find cutoff value based on ks MAXIMUM.
#Determine CutOff based on KS
KS_cutoff=cutoff_data$cutoff[which.max(cutoff_data$KS)]
KS_cutoff
## [1] 0.1111111
hence 0.1111111 is the cutoff value by ks max method.
test.score=predict(fit_final_model,newdata =test,type = "response")#on final test dataset.
_ Predicting whether the client has subscribed or no in final test dataset.
FinalScore=as.numeric(test.score>KS_cutoff)#if score is > cutoff then true(1) else false(0)
table(FinalScore)
## FinalScore
## 0 1
## 10173 3391
#Thus final prediction is as follows:
testFinal=factor(FinalScore,levels = c(0,1),labels=c("no","yes"))
table(testFinal)
## testFinal
## no yes
## 10173 3391
#write.csv(test$leftfinal,"P5_sub_1.csv")
write.csv(testFinal,"Neha_Raut_P5_part2.csv",row.names = F)
Creating confusion matrix and find how good our model is (by predicting on test_25 dataset)
score=predict(fit_final,newdata =test_25,type = "response")
table(test_25$y,as.numeric(score>KS_cutoff))
##
## 0 1
## 0 5885 1148
## 1 109 770
TP=770
FP=1148
P=1919
TN=5885
FN=109
N=5994
Accuracy=(TP+TN)/(P+N)
Accuracy
## [1] 0.8410211
#Error is according to ks method
1-Accuracy #15.89%
## [1] 0.1589789