Marketing campaign to sell term deposits: Predicting Target Customers

Execute by Neha Raut

Problem Statement:

A Portugese bank is rolling out term deposit for its customers. They have in the past connected to their customer base through phone calls. Results for these previous campaigns were recorded and have been provided to the current campaign manager to use the same in making this campaign more effective.

Challenges that the manager faces are following:

Customers have recently started to complain that bank’s marketing staff bothers them with irrelevant product calls and this should immediately stop

There is no prior framework for her decide and choose which customer to call and which one to leave alone

She has decided to use past data to automate this decision, instead of manually choosing through each and every customer. Previous campaign data which has been made available to her; contains customer characteristics , campaign characteristics, previous campaign information as well as whether customer ended up subscribing to the product as a result of that campaign or not. Using this she plans to develop a statistical model which given this information predicts whether customer in question will subscribe to the product or not. A successful model which is able to do this, will make her campaign efficiently targeted and less bothering to uninterested customers.

Aim

To Build a machine learning predictive model and predict which customers should be targeted for rolling out term deposits by bank.
Evaluation Criterion :KS score on test data. larger KS, better Model

Data Information:

We have given you two datasets , bank-full_train.csv and bank-full_test.csv . You need to use data bank-full_train to build predictive model for response variable “y”. bank-full_test data contains all other factors except “y”, you need to predict that using the model that you developed and submit your predicted values in a csv files.

Data dictionary:

Variables : Definition: Type and their categories

Each row represnts characteristic of a single customer . Many categorical data has been coded to mask the data, you dont need to worry about their exact meaning

1 - age (numeric)

2 - job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepre neur”,“student”, “blue-collar”, “self-employed”,“retired”,“technician”, “services”)

3 - marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed)

4 - education (categorical: “unknown”,“secondary”,“primary”,“tertiary”)

5 - default: has credit in default? (binary: “yes”,“no”)

6 - balance: average yearly balance, in euros (numeric)

7 - housing: has housing loan? (binary: “yes”,“no”)

8 - loan: has personal loan? (binary: “yes”,“no”)

Related with the last contact of the current campaign:

9 - contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”)

10 - day: last contact day of the month (numeric))

Direct Marketing Campaign: Details and Phase I Tasks

11 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, . . . , “nov”, “dec”)

12 - duration: last contact duration, in seconds (numeric)

other attributes: 13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

15 - previous: number of contacts performed before this campaign and for this client (numeric)

16 - poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)

Output variable (desired target):

17 - y - has the client subscribed a term deposit? (binary: “yes”,“no”)

Step 1: reading File

Combining both train n test datasets prior to data preparation.

loading library dplyr

library(dplyr)
library(ggplot2)
library(ROCR)

Read train and test datasets:

train=read.csv("bank-full_train.csv",stringsAsFactors = FALSE, header=T)
test=read.csv("bank-full_test.csv",stringsAsFactors = FALSE, header=T)

Step 2: Step 2:Data Preparation

Combining both train n test datasets prior to data preparation.

Before combining however , we’ll need some placeholder column which we can use to differentiate between observations coming from train and test data. Also we’ll need to add a column for response to test data so that we have same columns in both train and test. We’ll fill test’s response column with NAs.

#Combine both train and test data
test$y=NA
train$data='train'
test$data='test'
all_data=rbind(train,test)
glimpse(all_data)

## Observations: 45,211
## Variables: 19
## $ age       <int> 45, 34, 40, 58, 59, 36, 34, 38, 52, 48, 28, 43, 37, ...
## $ job       <chr> "blue-collar", "admin.", "technician", "self-employe...
## $ marital   <chr> "married", "divorced", "divorced", "married", "marri...
## $ education <chr> "secondary", "secondary", "secondary", "tertiary", "...
## $ default   <chr> "no", "no", "no", "no", "no", "no", "yes", "no", "no...
## $ balance   <int> 2, 0, 311, 5810, 169, 24, -868, -140, 98, 1203, 387,...
## $ housing   <chr> "no", "no", "no", "no", "yes", "yes", "yes", "no", "...
## $ loan      <chr> "no", "no", "no", "no", "no", "yes", "no", "no", "no...
## $ contact   <chr> "cellular", "cellular", "cellular", "cellular", "unk...
## $ day       <int> 26, 10, 6, 12, 16, 11, 30, 25, 28, 6, 3, 19, 21, 21,...
## $ month     <chr> "aug", "jul", "aug", "mar", "may", "jun", "jun", "ju...
## $ duration  <int> 105, 268, 738, 139, 181, 100, 198, 456, 103, 61, 158...
## $ campaign  <int> 10, 1, 2, 1, 3, 5, 2, 3, 1, 5, 5, 1, 4, 1, 2, 4, 1, ...
## $ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, 245, -1, -1, 198, -1...
## $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0...
## $ poutcome  <chr> "unknown", "unknown", "unknown", "unknown", "unknown...
## $ ID        <int> 22944, 13870, 19301, 31334, 3849, 10192, 12411, 1231...
## $ y         <chr> "no", "no", "yes", "yes", "no", "no", "no", "no", "n...
## $ data      <chr> "train", "train", "train", "train", "train", "train"...

apply(all_data,2,function(x)sum(is.na(x)))

##       age       job   marital education   default   balance   housing 
##         0         0         0         0         0         0         0 
##      loan   contact       day     month  duration  campaign     pdays 
##         0         0         0         0         0         0         0 
##  previous  poutcome        ID         y      data 
##         0         0         0     13564         0

glimpse(all_data)

## Observations: 45,211
## Variables: 19
## $ age       <int> 45, 34, 40, 58, 59, 36, 34, 38, 52, 48, 28, 43, 37, ...
## $ job       <chr> "blue-collar", "admin.", "technician", "self-employe...
## $ marital   <chr> "married", "divorced", "divorced", "married", "marri...
## $ education <chr> "secondary", "secondary", "secondary", "tertiary", "...
## $ default   <chr> "no", "no", "no", "no", "no", "no", "yes", "no", "no...
## $ balance   <int> 2, 0, 311, 5810, 169, 24, -868, -140, 98, 1203, 387,...
## $ housing   <chr> "no", "no", "no", "no", "yes", "yes", "yes", "no", "...
## $ loan      <chr> "no", "no", "no", "no", "no", "yes", "no", "no", "no...
## $ contact   <chr> "cellular", "cellular", "cellular", "cellular", "unk...
## $ day       <int> 26, 10, 6, 12, 16, 11, 30, 25, 28, 6, 3, 19, 21, 21,...
## $ month     <chr> "aug", "jul", "aug", "mar", "may", "jun", "jun", "ju...
## $ duration  <int> 105, 268, 738, 139, 181, 100, 198, 456, 103, 61, 158...
## $ campaign  <int> 10, 1, 2, 1, 3, 5, 2, 3, 1, 5, 5, 1, 4, 1, 2, 4, 1, ...
## $ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, 245, -1, -1, 198, -1...
## $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0...
## $ poutcome  <chr> "unknown", "unknown", "unknown", "unknown", "unknown...
## $ ID        <int> 22944, 13870, 19301, 31334, 3849, 10192, 12411, 1231...
## $ y         <chr> "no", "no", "yes", "yes", "no", "no", "no", "no", "n...
## $ data      <chr> "train", "train", "train", "train", "train", "train"...

finding out the number of distinct caztegories above in character variables.(excludes y as its the target variable)

for(i in 1:ncol(all_data)){
  if(class(all_data[,i])=="character"){
    if(names(all_data)[i]!="y"){
      message=paste("Number of categories in ",names(all_data)[i]," : ")
      num.cat=length(unique(all_data[,i]))
      print(paste0(message,num.cat))
    }
  }
}

## [1] "Number of categories in  job  : 12"
## [1] "Number of categories in  marital  : 3"
## [1] "Number of categories in  education  : 4"
## [1] "Number of categories in  default  : 2"
## [1] "Number of categories in  housing  : 2"
## [1] "Number of categories in  loan  : 2"
## [1] "Number of categories in  contact  : 3"
## [1] "Number of categories in  month  : 12"
## [1] "Number of categories in  poutcome  : 4"
## [1] "Number of categories in  data  : 2"

Creating dummy variables by combining similar categories for variable job(char type)

t=table(all_data$job)
sort(t)

## 
##       unknown       student     housemaid    unemployed  entrepreneur 
##           288           938          1240          1303          1487 
## self-employed       retired      services        admin.    technician 
##          1579          2264          4154          5171          7597 
##    management   blue-collar 
##          9458          9732

final=round(prop.table(table(all_data$job,all_data$y),1)*100,1)
final

##                
##                   no  yes
##   admin.        87.9 12.1
##   blue-collar   92.8  7.2
##   entrepreneur  90.7  9.3
##   housemaid     92.4  7.6
##   management    86.3 13.7
##   retired       76.8 23.2
##   self-employed 87.9 12.1
##   services      91.0  9.0
##   student       70.8 29.2
##   technician    88.6 11.4
##   unemployed    85.1 14.9
##   unknown       88.2 11.8

#Add Margins
s=addmargins(final,2) #add margin across Y ,2 means we will get sum on column
sort(s[,1])

##       student       retired    unemployed    management        admin. 
##          70.8          76.8          85.1          86.3          87.9 
## self-employed       unknown    technician  entrepreneur      services 
##          87.9          88.2          88.6          90.7          91.0 
##     housemaid   blue-collar 
##          92.4          92.8

#create n-1 dummies and ignore which close to big
all_data=all_data %>% 
  mutate(job_1=as.numeric(job %in% c("self-employed","unknown","technician")), 
         job_2=as.numeric(job %in% c("services","housemaid","entrepreneur")),
         job_3=as.numeric(job %in% c("management","admin")),
         job_4=as.numeric(job=="student"),
         job_5=as.numeric(job=="retired"),
         job_6=as.numeric(job=="unemployed")) %>% 
  select(-job)

glimpse(all_data)

## Observations: 45,211
## Variables: 24
## $ age       <int> 45, 34, 40, 58, 59, 36, 34, 38, 52, 48, 28, 43, 37, ...
## $ marital   <chr> "married", "divorced", "divorced", "married", "marri...
## $ education <chr> "secondary", "secondary", "secondary", "tertiary", "...
## $ default   <chr> "no", "no", "no", "no", "no", "no", "yes", "no", "no...
## $ balance   <int> 2, 0, 311, 5810, 169, 24, -868, -140, 98, 1203, 387,...
## $ housing   <chr> "no", "no", "no", "no", "yes", "yes", "yes", "no", "...
## $ loan      <chr> "no", "no", "no", "no", "no", "yes", "no", "no", "no...
## $ contact   <chr> "cellular", "cellular", "cellular", "cellular", "unk...
## $ day       <int> 26, 10, 6, 12, 16, 11, 30, 25, 28, 6, 3, 19, 21, 21,...
## $ month     <chr> "aug", "jul", "aug", "mar", "may", "jun", "jun", "ju...
## $ duration  <int> 105, 268, 738, 139, 181, 100, 198, 456, 103, 61, 158...
## $ campaign  <int> 10, 1, 2, 1, 3, 5, 2, 3, 1, 5, 5, 1, 4, 1, 2, 4, 1, ...
## $ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, 245, -1, -1, 198, -1...
## $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0...
## $ poutcome  <chr> "unknown", "unknown", "unknown", "unknown", "unknown...
## $ ID        <int> 22944, 13870, 19301, 31334, 3849, 10192, 12411, 1231...
## $ y         <chr> "no", "no", "yes", "yes", "no", "no", "no", "no", "n...
## $ data      <chr> "train", "train", "train", "train", "train", "train"...
## $ job_1     <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1...
## $ job_2     <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0...
## $ job_3     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0...
## $ job_4     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ job_5     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ job_6     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

#Marital
t=table(all_data$marital)
sort(t)

## 
## divorced   single  married 
##     5207    12790    27214

all_data=all_data %>% 
  mutate(divorced=as.numeric(marital %in% c("divorced")),
         single=as.numeric(marital %in% c("single"))
  ) %>% 
  select(-marital)
glimpse(all_data)

## Observations: 45,211
## Variables: 25
## $ age       <int> 45, 34, 40, 58, 59, 36, 34, 38, 52, 48, 28, 43, 37, ...
## $ education <chr> "secondary", "secondary", "secondary", "tertiary", "...
## $ default   <chr> "no", "no", "no", "no", "no", "no", "yes", "no", "no...
## $ balance   <int> 2, 0, 311, 5810, 169, 24, -868, -140, 98, 1203, 387,...
## $ housing   <chr> "no", "no", "no", "no", "yes", "yes", "yes", "no", "...
## $ loan      <chr> "no", "no", "no", "no", "no", "yes", "no", "no", "no...
## $ contact   <chr> "cellular", "cellular", "cellular", "cellular", "unk...
## $ day       <int> 26, 10, 6, 12, 16, 11, 30, 25, 28, 6, 3, 19, 21, 21,...
## $ month     <chr> "aug", "jul", "aug", "mar", "may", "jun", "jun", "ju...
## $ duration  <int> 105, 268, 738, 139, 181, 100, 198, 456, 103, 61, 158...
## $ campaign  <int> 10, 1, 2, 1, 3, 5, 2, 3, 1, 5, 5, 1, 4, 1, 2, 4, 1, ...
## $ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, 245, -1, -1, 198, -1...
## $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0, 0...
## $ poutcome  <chr> "unknown", "unknown", "unknown", "unknown", "unknown...
## $ ID        <int> 22944, 13870, 19301, 31334, 3849, 10192, 12411, 1231...
## $ y         <chr> "no", "no", "yes", "yes", "no", "no", "no", "no", "n...
## $ data      <chr> "train", "train", "train", "train", "train", "train"...
## $ job_1     <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1...
## $ job_2     <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0...
## $ job_3     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0...
## $ job_4     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ job_5     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ job_6     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ divorced  <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0...
## $ single    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0...

#Education
t=table(all_data$education)
sort(t)

## 
##   unknown   primary  tertiary secondary 
##      1857      6851     13301     23202

all_data=all_data %>% 
  mutate(edu_primary=as.numeric(education %in% c("primary")),
         edu_sec=as.numeric(education %in% c("secondary")),
         edu_tert=as.numeric(education %in% c("tertiary"))
  ) %>% 
  select(-education)
glimpse(all_data)

## Observations: 45,211
## Variables: 27
## $ age         <int> 45, 34, 40, 58, 59, 36, 34, 38, 52, 48, 28, 43, 37...
## $ default     <chr> "no", "no", "no", "no", "no", "no", "yes", "no", "...
## $ balance     <int> 2, 0, 311, 5810, 169, 24, -868, -140, 98, 1203, 38...
## $ housing     <chr> "no", "no", "no", "no", "yes", "yes", "yes", "no",...
## $ loan        <chr> "no", "no", "no", "no", "no", "yes", "no", "no", "...
## $ contact     <chr> "cellular", "cellular", "cellular", "cellular", "u...
## $ day         <int> 26, 10, 6, 12, 16, 11, 30, 25, 28, 6, 3, 19, 21, 2...
## $ month       <chr> "aug", "jul", "aug", "mar", "may", "jun", "jun", "...
## $ duration    <int> 105, 268, 738, 139, 181, 100, 198, 456, 103, 61, 1...
## $ campaign    <int> 10, 1, 2, 1, 3, 5, 2, 3, 1, 5, 5, 1, 4, 1, 2, 4, 1...
## $ pdays       <int> -1, -1, -1, -1, -1, -1, -1, -1, 245, -1, -1, 198, ...
## $ previous    <int> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0,...
## $ poutcome    <chr> "unknown", "unknown", "unknown", "unknown", "unkno...
## $ ID          <int> 22944, 13870, 19301, 31334, 3849, 10192, 12411, 12...
## $ y           <chr> "no", "no", "yes", "yes", "no", "no", "no", "no", ...
## $ data        <chr> "train", "train", "train", "train", "train", "trai...
## $ job_1       <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ job_2       <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,...
## $ job_3       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,...
## $ job_4       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ job_5       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ job_6       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ divorced    <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ single      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,...
## $ edu_primary <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ edu_sec     <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1,...
## $ edu_tert    <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,...

#for varible default
table(all_data$default)

## 
##    no   yes 
## 44396   815

all_data$default=as.numeric(all_data$default=="yes")

#Housing
table(all_data$housing)

## 
##    no   yes 
## 20081 25130

all_data$housing=as.numeric(all_data$housing=="yes")
glimpse(all_data)

## Observations: 45,211
## Variables: 27
## $ age         <int> 45, 34, 40, 58, 59, 36, 34, 38, 52, 48, 28, 43, 37...
## $ default     <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ balance     <int> 2, 0, 311, 5810, 169, 24, -868, -140, 98, 1203, 38...
## $ housing     <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,...
## $ loan        <chr> "no", "no", "no", "no", "no", "yes", "no", "no", "...
## $ contact     <chr> "cellular", "cellular", "cellular", "cellular", "u...
## $ day         <int> 26, 10, 6, 12, 16, 11, 30, 25, 28, 6, 3, 19, 21, 2...
## $ month       <chr> "aug", "jul", "aug", "mar", "may", "jun", "jun", "...
## $ duration    <int> 105, 268, 738, 139, 181, 100, 198, 456, 103, 61, 1...
## $ campaign    <int> 10, 1, 2, 1, 3, 5, 2, 3, 1, 5, 5, 1, 4, 1, 2, 4, 1...
## $ pdays       <int> -1, -1, -1, -1, -1, -1, -1, -1, 245, -1, -1, 198, ...
## $ previous    <int> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0,...
## $ poutcome    <chr> "unknown", "unknown", "unknown", "unknown", "unkno...
## $ ID          <int> 22944, 13870, 19301, 31334, 3849, 10192, 12411, 12...
## $ y           <chr> "no", "no", "yes", "yes", "no", "no", "no", "no", ...
## $ data        <chr> "train", "train", "train", "train", "train", "trai...
## $ job_1       <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ job_2       <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,...
## $ job_3       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,...
## $ job_4       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ job_5       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ job_6       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ divorced    <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ single      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,...
## $ edu_primary <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ edu_sec     <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1,...
## $ edu_tert    <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,...

#Loan
table(all_data$loan)

## 
##    no   yes 
## 37967  7244

all_data$loan=as.numeric(all_data$loan=="yes")
glimpse(all_data)

## Observations: 45,211
## Variables: 27
## $ age         <int> 45, 34, 40, 58, 59, 36, 34, 38, 52, 48, 28, 43, 37...
## $ default     <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ balance     <int> 2, 0, 311, 5810, 169, 24, -868, -140, 98, 1203, 38...
## $ housing     <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,...
## $ loan        <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ contact     <chr> "cellular", "cellular", "cellular", "cellular", "u...
## $ day         <int> 26, 10, 6, 12, 16, 11, 30, 25, 28, 6, 3, 19, 21, 2...
## $ month       <chr> "aug", "jul", "aug", "mar", "may", "jun", "jun", "...
## $ duration    <int> 105, 268, 738, 139, 181, 100, 198, 456, 103, 61, 1...
## $ campaign    <int> 10, 1, 2, 1, 3, 5, 2, 3, 1, 5, 5, 1, 4, 1, 2, 4, 1...
## $ pdays       <int> -1, -1, -1, -1, -1, -1, -1, -1, 245, -1, -1, 198, ...
## $ previous    <int> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0,...
## $ poutcome    <chr> "unknown", "unknown", "unknown", "unknown", "unkno...
## $ ID          <int> 22944, 13870, 19301, 31334, 3849, 10192, 12411, 12...
## $ y           <chr> "no", "no", "yes", "yes", "no", "no", "no", "no", ...
## $ data        <chr> "train", "train", "train", "train", "train", "trai...
## $ job_1       <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ job_2       <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,...
## $ job_3       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,...
## $ job_4       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ job_5       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ job_6       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ divorced    <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ single      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,...
## $ edu_primary <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ edu_sec     <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1,...
## $ edu_tert    <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,...

#Contact
t=table(all_data$contact)
sort(t)

## 
## telephone   unknown  cellular 
##      2906     13020     29285

all_data=all_data %>% 
  mutate(co_cellular=as.numeric(contact %in% c("cellular")),
         co_tel=as.numeric(contact %in% c("telephone"))
  ) %>% 
  select(-contact)
glimpse(all_data)

## Observations: 45,211
## Variables: 28
## $ age         <int> 45, 34, 40, 58, 59, 36, 34, 38, 52, 48, 28, 43, 37...
## $ default     <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ balance     <int> 2, 0, 311, 5810, 169, 24, -868, -140, 98, 1203, 38...
## $ housing     <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,...
## $ loan        <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ day         <int> 26, 10, 6, 12, 16, 11, 30, 25, 28, 6, 3, 19, 21, 2...
## $ month       <chr> "aug", "jul", "aug", "mar", "may", "jun", "jun", "...
## $ duration    <int> 105, 268, 738, 139, 181, 100, 198, 456, 103, 61, 1...
## $ campaign    <int> 10, 1, 2, 1, 3, 5, 2, 3, 1, 5, 5, 1, 4, 1, 2, 4, 1...
## $ pdays       <int> -1, -1, -1, -1, -1, -1, -1, -1, 245, -1, -1, 198, ...
## $ previous    <int> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0,...
## $ poutcome    <chr> "unknown", "unknown", "unknown", "unknown", "unkno...
## $ ID          <int> 22944, 13870, 19301, 31334, 3849, 10192, 12411, 12...
## $ y           <chr> "no", "no", "yes", "yes", "no", "no", "no", "no", ...
## $ data        <chr> "train", "train", "train", "train", "train", "trai...
## $ job_1       <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ job_2       <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,...
## $ job_3       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,...
## $ job_4       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ job_5       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ job_6       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ divorced    <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ single      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,...
## $ edu_primary <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ edu_sec     <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1,...
## $ edu_tert    <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,...
## $ co_cellular <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0,...
## $ co_tel      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...

#Month
table(all_data$month)

## 
##   apr   aug   dec   feb   jan   jul   jun   mar   may   nov   oct   sep 
##  2932  6247   214  2649  1403  6895  5341   477 13766  3970   738   579

finalmnth=round(prop.table(table(all_data$month,all_data$y),1)*100,1)
sss=addmargins(finalmnth,2) #adding margin across Y
sort(sss[,1])

##  mar  oct  sep  dec  apr  feb  aug  jan  nov  jun  jul  may 
## 46.7 55.0 57.8 58.0 80.1 83.1 88.7 89.1 89.3 90.1 91.0 93.2

#Ignor may
all_data=all_data %>% 
  mutate(month_1=as.numeric(month %in% c("aug","jan","jun","nov","jul")), 
         month_2=as.numeric(month %in% c("dec","sep")),
         month_3=as.numeric(month=="mar"),
         month_4=as.numeric(month=="oct"),
         month_5=as.numeric(month=="apr"),
         month_6=as.numeric(month=="feb")) %>% 
  select(-month)
glimpse(all_data)

## Observations: 45,211
## Variables: 33
## $ age         <int> 45, 34, 40, 58, 59, 36, 34, 38, 52, 48, 28, 43, 37...
## $ default     <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ balance     <int> 2, 0, 311, 5810, 169, 24, -868, -140, 98, 1203, 38...
## $ housing     <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,...
## $ loan        <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ day         <int> 26, 10, 6, 12, 16, 11, 30, 25, 28, 6, 3, 19, 21, 2...
## $ duration    <int> 105, 268, 738, 139, 181, 100, 198, 456, 103, 61, 1...
## $ campaign    <int> 10, 1, 2, 1, 3, 5, 2, 3, 1, 5, 5, 1, 4, 1, 2, 4, 1...
## $ pdays       <int> -1, -1, -1, -1, -1, -1, -1, -1, 245, -1, -1, 198, ...
## $ previous    <int> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0,...
## $ poutcome    <chr> "unknown", "unknown", "unknown", "unknown", "unkno...
## $ ID          <int> 22944, 13870, 19301, 31334, 3849, 10192, 12411, 12...
## $ y           <chr> "no", "no", "yes", "yes", "no", "no", "no", "no", ...
## $ data        <chr> "train", "train", "train", "train", "train", "trai...
## $ job_1       <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ job_2       <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,...
## $ job_3       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,...
## $ job_4       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ job_5       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ job_6       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ divorced    <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ single      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,...
## $ edu_primary <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ edu_sec     <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1,...
## $ edu_tert    <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,...
## $ co_cellular <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0,...
## $ co_tel      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ month_1     <dbl> 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0,...
## $ month_2     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ month_3     <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ month_4     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ month_5     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ month_6     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...

#Outcome
t=table(all_data$poutcome)
sort(t)

## 
## success   other failure unknown 
##    1511    1840    4901   36959

all_data=all_data %>% 
  mutate(poc_success=as.numeric(poutcome %in% c("success")),
         poc_failure=as.numeric(poutcome %in% c("failure")),
         poc_other=as.numeric(poutcome %in% c("other"))
  )%>% 
  select(-poutcome)
glimpse(all_data)

## Observations: 45,211
## Variables: 35
## $ age         <int> 45, 34, 40, 58, 59, 36, 34, 38, 52, 48, 28, 43, 37...
## $ default     <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ balance     <int> 2, 0, 311, 5810, 169, 24, -868, -140, 98, 1203, 38...
## $ housing     <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,...
## $ loan        <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ day         <int> 26, 10, 6, 12, 16, 11, 30, 25, 28, 6, 3, 19, 21, 2...
## $ duration    <int> 105, 268, 738, 139, 181, 100, 198, 456, 103, 61, 1...
## $ campaign    <int> 10, 1, 2, 1, 3, 5, 2, 3, 1, 5, 5, 1, 4, 1, 2, 4, 1...
## $ pdays       <int> -1, -1, -1, -1, -1, -1, -1, -1, 245, -1, -1, 198, ...
## $ previous    <int> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0,...
## $ ID          <int> 22944, 13870, 19301, 31334, 3849, 10192, 12411, 12...
## $ y           <chr> "no", "no", "yes", "yes", "no", "no", "no", "no", ...
## $ data        <chr> "train", "train", "train", "train", "train", "trai...
## $ job_1       <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ job_2       <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,...
## $ job_3       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,...
## $ job_4       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ job_5       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ job_6       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ divorced    <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ single      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,...
## $ edu_primary <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ edu_sec     <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1,...
## $ edu_tert    <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,...
## $ co_cellular <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0,...
## $ co_tel      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ month_1     <dbl> 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0,...
## $ month_2     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ month_3     <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ month_4     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ month_5     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ month_6     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ poc_success <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ poc_failure <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ poc_other   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,...

Prep is done now We need to convert our Response variable to 1/0 or yes/no

all_data$y=as.numeric(all_data$y=="yes")
table(all_data$y)

## 
##     0     1 
## 27927  3720

#Next we take care of missing values if any in the data.
all_data=all_data[!((is.na(all_data$y)) & all_data$data=='train'), ]

for(col in names(all_data)){
  if(sum(is.na(all_data[,col]))>0 & !(col %in% c("data","y"))){
    all_data[is.na(all_data[,col]),col]=mean(all_data[all_data$data=='train',col],na.rm=T)
  }
}

sum(is.na(all_data$data=='train'))

## [1] 0

Thus data preparation is done and we will now seperate both test n train data.

train=all_data %>% 
  filter(data=='train') %>% 
  select(-data) #31647,34

test=all_data %>% 
  filter(data=='test') %>% 
  select(-data,-y)

Step 3: Model Building

We will use train for logistic regression model building and use train_25 to test the performance of the model thus built. Lets build logistic regression model on train dataset.

set.seed(5)
s=sample(1:nrow(train),0.75*nrow(train))
train_75=train[s,] #23735,34
test_25=train[-s,]#7912,34

#Find out vif >5
library(car)

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

for_vif=lm(y~.,data=train_75)
summary(for_vif)

## 
## Call:
## lm(formula = y ~ ., data = train_75)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.73998 -0.11938 -0.03431  0.03900  1.04734 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.645e-01  1.502e-02 -10.955  < 2e-16 ***
## age          4.555e-04  2.130e-04   2.138 0.032499 *  
## default      1.388e-03  1.301e-02   0.107 0.915086    
## balance      1.064e-06  5.623e-07   1.893 0.058378 .  
## housing     -3.364e-02  4.129e-03  -8.147 3.92e-16 ***
## loan        -1.477e-02  4.817e-03  -3.066 0.002170 ** 
## day          5.519e-04  2.235e-04   2.469 0.013550 *  
## duration     4.940e-04  6.874e-06  71.874  < 2e-16 ***
## campaign    -4.345e-04  5.931e-04  -0.733 0.463742    
## pdays       -6.411e-05  3.771e-05  -1.700 0.089095 .  
## previous     5.690e-04  7.702e-04   0.739 0.460063    
## ID           7.573e-06  2.548e-07  29.718  < 2e-16 ***
## job_1        5.941e-03  5.158e-03   1.152 0.249391    
## job_2       -1.916e-03  5.401e-03  -0.355 0.722799    
## job_3        1.312e-02  6.282e-03   2.089 0.036736 *  
## job_4        3.814e-02  1.315e-02   2.900 0.003738 ** 
## job_5        3.533e-02  9.541e-03   3.703 0.000214 ***
## job_6       -5.377e-03  1.082e-02  -0.497 0.619383    
## divorced     1.301e-02  5.634e-03   2.309 0.020963 *  
## single       2.035e-02  4.461e-03   4.562 5.10e-06 ***
## edu_primary -5.535e-03  9.893e-03  -0.560 0.575810    
## edu_sec      2.825e-03  9.135e-03   0.309 0.757090    
## edu_tert     9.105e-03  9.789e-03   0.930 0.352326    
## co_cellular -1.022e-01  6.779e-03 -15.073  < 2e-16 ***
## co_tel      -1.131e-01  9.408e-03 -12.021  < 2e-16 ***
## month_1      2.873e-02  4.896e-03   5.869 4.43e-09 ***
## month_2      9.915e-02  1.438e-02   6.894 5.55e-12 ***
## month_3      3.242e-01  1.765e-02  18.372  < 2e-16 ***
## month_4      1.803e-01  1.449e-02  12.442  < 2e-16 ***
## month_5      4.452e-02  7.883e-03   5.648 1.64e-08 ***
## month_6      3.337e-02  8.732e-03   3.821 0.000133 ***
## poc_success  3.540e-01  1.219e-02  29.048  < 2e-16 ***
## poc_failure -2.388e-02  1.112e-02  -2.148 0.031739 *  
## poc_other    2.657e-04  1.299e-02   0.020 0.983679    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2667 on 23701 degrees of freedom
## Multiple R-squared:  0.3258, Adjusted R-squared:  0.3248 
## F-statistic:   347 on 33 and 23701 DF,  p-value: < 2.2e-16

sort(vif(for_vif),decreasing = T)[1:3]

##  edu_sec edu_tert    pdays 
## 6.952724 6.607517 4.769001

#So remove edu sec from train
for_vif=lm(y~.-edu_sec,data=train_75)
sort(vif(for_vif),decreasing = T)[1:3]

##       pdays poc_failure          ID 
##    4.768821    3.965875    3.701446

summary(for_vif)

## 
## Call:
## lm(formula = y ~ . - edu_sec, data = train_75)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.73976 -0.11941 -0.03441  0.03904  1.04751 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.617e-01  1.193e-02 -13.562  < 2e-16 ***
## age          4.491e-04  2.120e-04   2.118 0.034151 *  
## default      1.367e-03  1.301e-02   0.105 0.916368    
## balance      1.065e-06  5.623e-07   1.894 0.058287 .  
## housing     -3.361e-02  4.128e-03  -8.142 4.09e-16 ***
## loan        -1.470e-02  4.812e-03  -3.055 0.002252 ** 
## day          5.518e-04  2.235e-04   2.469 0.013564 *  
## duration     4.940e-04  6.874e-06  71.875  < 2e-16 ***
## campaign    -4.331e-04  5.930e-04  -0.730 0.465213    
## pdays       -6.418e-05  3.770e-05  -1.702 0.088726 .  
## previous     5.689e-04  7.702e-04   0.739 0.460100    
## ID           7.573e-06  2.548e-07  29.717  < 2e-16 ***
## job_1        5.937e-03  5.158e-03   1.151 0.249690    
## job_2       -1.901e-03  5.401e-03  -0.352 0.724917    
## job_3        1.306e-02  6.278e-03   2.080 0.037554 *  
## job_4        3.777e-02  1.310e-02   2.884 0.003935 ** 
## job_5        3.542e-02  9.537e-03   3.714 0.000204 ***
## job_6       -5.324e-03  1.082e-02  -0.492 0.622741    
## divorced     1.306e-02  5.632e-03   2.318 0.020434 *  
## single       2.032e-02  4.460e-03   4.556 5.23e-06 ***
## edu_primary -8.135e-03  5.217e-03  -1.559 0.118898    
## edu_tert     6.514e-03  5.064e-03   1.286 0.198365    
## co_cellular -1.021e-01  6.775e-03 -15.071  < 2e-16 ***
## co_tel      -1.131e-01  9.407e-03 -12.018  < 2e-16 ***
## month_1      2.873e-02  4.896e-03   5.869 4.45e-09 ***
## month_2      9.908e-02  1.438e-02   6.890 5.71e-12 ***
## month_3      3.242e-01  1.765e-02  18.371  < 2e-16 ***
## month_4      1.803e-01  1.449e-02  12.442  < 2e-16 ***
## month_5      4.453e-02  7.883e-03   5.649 1.63e-08 ***
## month_6      3.335e-02  8.732e-03   3.820 0.000134 ***
## poc_success  3.541e-01  1.219e-02  29.052  < 2e-16 ***
## poc_failure -2.385e-02  1.112e-02  -2.145 0.031923 *  
## poc_other    3.201e-04  1.299e-02   0.025 0.980335    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2667 on 23702 degrees of freedom
## Multiple R-squared:  0.3258, Adjusted R-squared:  0.3248 
## F-statistic: 357.9 on 32 and 23702 DF,  p-value: < 2.2e-16

lets build final logistic model on significant variables on dataset fit_train

#Lets build model on fit_train dataset, always  use family as binomial for logistic regreession:

fit=glm(y~.,data=train_75, family = "binomial") #32 predictor var
summary(fit) #we get aic as 10789.53  #Lower the aic good thge model is

## 
## Call:
## glm(formula = y ~ ., family = "binomial", data = train_75)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.8435  -0.3500  -0.2119  -0.1152   3.2744  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.337e+00  2.387e-01 -26.550  < 2e-16 ***
## age          1.836e-03  3.056e-03   0.601 0.547894    
## default      2.645e-02  2.455e-01   0.108 0.914198    
## balance      1.292e-05  6.806e-06   1.898 0.057750 .  
## housing     -4.661e-01  6.120e-02  -7.616 2.61e-14 ***
## loan        -2.623e-01  8.336e-02  -3.147 0.001651 ** 
## day          5.016e-03  3.219e-03   1.558 0.119175    
## duration     4.644e-03  9.512e-05  48.828  < 2e-16 ***
## campaign    -5.297e-02  1.391e-02  -3.807 0.000141 ***
## pdays       -5.368e-05  4.011e-04  -0.134 0.893533    
## previous     4.628e-03  7.213e-03   0.642 0.521155    
## ID           1.025e-04  3.560e-06  28.783  < 2e-16 ***
## job_1        7.968e-02  7.810e-02   1.020 0.307607    
## job_2       -4.310e-02  8.704e-02  -0.495 0.620497    
## job_3        1.573e-01  8.991e-02   1.750 0.080182 .  
## job_4        1.051e-01  1.439e-01   0.730 0.465176    
## job_5        2.545e-01  1.269e-01   2.007 0.044801 *  
## job_6       -1.377e-01  1.508e-01  -0.913 0.360984    
## divorced     2.404e-01  8.274e-02   2.906 0.003661 ** 
## single       2.588e-01  6.507e-02   3.978 6.94e-05 ***
## edu_primary -1.090e-01  1.454e-01  -0.749 0.453666    
## edu_sec      9.410e-02  1.291e-01   0.729 0.466049    
## edu_tert     1.667e-01  1.357e-01   1.228 0.219420    
## co_cellular -8.584e-01  1.030e-01  -8.331  < 2e-16 ***
## co_tel      -1.037e+00  1.424e-01  -7.286 3.20e-13 ***
## month_1      5.589e-01  7.625e-02   7.329 2.32e-13 ***
## month_2      5.998e-01  1.384e-01   4.335 1.46e-05 ***
## month_3      2.263e+00  1.645e-01  13.755  < 2e-16 ***
## month_4      1.134e+00  1.392e-01   8.149 3.67e-16 ***
## month_5      7.157e-01  9.684e-02   7.390 1.46e-13 ***
## month_6      6.463e-01  1.141e-01   5.663 1.48e-08 ***
## poc_success  1.514e+00  1.153e-01  13.132  < 2e-16 ***
## poc_failure -3.830e-01  1.238e-01  -3.094 0.001975 ** 
## poc_other   -1.598e-01  1.441e-01  -1.109 0.267591    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 17389  on 23734  degrees of freedom
## Residual deviance: 10730  on 23701  degrees of freedom
## AIC: 10798
## 
## Number of Fisher Scoring iterations: 6

All VIF values are under control.Now we look at the model summary , specifically p-values asscociated with the variables. We can drop variables with high p-values [ >0.05] one by one or we can use step function which drops vars based on AIC score one by one. Although the methodology is different but end result is generally. similar due to both of them targetting vars which do not contribute towards explaning not very well.

#Remove variable having p>0.5 one by one
fit=step(fit)

## Start:  AIC=10798.48
## y ~ age + default + balance + housing + loan + day + duration + 
##     campaign + pdays + previous + ID + job_1 + job_2 + job_3 + 
##     job_4 + job_5 + job_6 + divorced + single + edu_primary + 
##     edu_sec + edu_tert + co_cellular + co_tel + month_1 + month_2 + 
##     month_3 + month_4 + month_5 + month_6 + poc_success + poc_failure + 
##     poc_other
## 
##               Df Deviance   AIC
## - default      1    10730 10796
## - pdays        1    10730 10796
## - job_2        1    10731 10797
## - previous     1    10731 10797
## - age          1    10731 10797
## - job_4        1    10731 10797
## - edu_sec      1    10731 10797
## - edu_primary  1    10731 10797
## - job_6        1    10731 10797
## - job_1        1    10732 10798
## - poc_other    1    10732 10798
## - edu_tert     1    10732 10798
## <none>              10730 10798
## - day          1    10733 10799
## - job_3        1    10734 10800
## - balance      1    10734 10800
## - job_5        1    10734 10800
## - divorced     1    10739 10805
## - poc_failure  1    10740 10806
## - loan         1    10741 10807
## - single       1    10746 10812
## - campaign     1    10746 10812
## - month_2      1    10749 10815
## - month_6      1    10762 10828
## - month_5      1    10784 10850
## - co_tel       1    10785 10851
## - month_1      1    10785 10851
## - housing      1    10789 10855
## - month_4      1    10795 10861
## - co_cellular  1    10797 10863
## - poc_success  1    10906 10972
## - month_3      1    10911 10977
## - ID           1    11574 11640
## - duration     1    14015 14081
## 
## Step:  AIC=10796.49
## y ~ age + balance + housing + loan + day + duration + campaign + 
##     pdays + previous + ID + job_1 + job_2 + job_3 + job_4 + job_5 + 
##     job_6 + divorced + single + edu_primary + edu_sec + edu_tert + 
##     co_cellular + co_tel + month_1 + month_2 + month_3 + month_4 + 
##     month_5 + month_6 + poc_success + poc_failure + poc_other
## 
##               Df Deviance   AIC
## - pdays        1    10730 10794
## - job_2        1    10731 10795
## - previous     1    10731 10795
## - age          1    10731 10795
## - job_4        1    10731 10795
## - edu_sec      1    10731 10795
## - edu_primary  1    10731 10795
## - job_6        1    10731 10795
## - job_1        1    10732 10796
## - poc_other    1    10732 10796
## - edu_tert     1    10732 10796
## <none>              10730 10796
## - day          1    10733 10797
## - job_3        1    10734 10798
## - balance      1    10734 10798
## - job_5        1    10734 10798
## - divorced     1    10739 10803
## - poc_failure  1    10740 10804
## - loan         1    10741 10805
## - campaign     1    10746 10810
## - single       1    10746 10810
## - month_2      1    10749 10813
## - month_6      1    10762 10826
## - month_5      1    10784 10848
## - co_tel       1    10785 10849
## - month_1      1    10785 10849
## - housing      1    10789 10853
## - month_4      1    10795 10859
## - co_cellular  1    10797 10861
## - poc_success  1    10906 10970
## - month_3      1    10911 10975
## - ID           1    11575 11639
## - duration     1    14016 14080
## 
## Step:  AIC=10794.5
## y ~ age + balance + housing + loan + day + duration + campaign + 
##     previous + ID + job_1 + job_2 + job_3 + job_4 + job_5 + job_6 + 
##     divorced + single + edu_primary + edu_sec + edu_tert + co_cellular + 
##     co_tel + month_1 + month_2 + month_3 + month_4 + month_5 + 
##     month_6 + poc_success + poc_failure + poc_other
## 
##               Df Deviance   AIC
## - job_2        1    10731 10793
## - previous     1    10731 10793
## - age          1    10731 10793
## - job_4        1    10731 10793
## - edu_sec      1    10731 10793
## - edu_primary  1    10731 10793
## - job_6        1    10731 10793
## - job_1        1    10732 10794
## - edu_tert     1    10732 10794
## <none>              10730 10794
## - poc_other    1    10733 10795
## - day          1    10733 10795
## - job_3        1    10734 10796
## - balance      1    10734 10796
## - job_5        1    10734 10796
## - divorced     1    10739 10801
## - loan         1    10741 10803
## - single       1    10746 10808
## - campaign     1    10746 10808
## - month_2      1    10749 10811
## - poc_failure  1    10753 10815
## - month_6      1    10762 10824
## - month_5      1    10784 10846
## - co_tel       1    10785 10847
## - month_1      1    10786 10848
## - housing      1    10790 10852
## - month_4      1    10796 10858
## - co_cellular  1    10797 10859
## - month_3      1    10912 10974
## - poc_success  1    10976 11038
## - ID           1    11576 11638
## - duration     1    14016 14078
## 
## Step:  AIC=10792.75
## y ~ age + balance + housing + loan + day + duration + campaign + 
##     previous + ID + job_1 + job_3 + job_4 + job_5 + job_6 + divorced + 
##     single + edu_primary + edu_sec + edu_tert + co_cellular + 
##     co_tel + month_1 + month_2 + month_3 + month_4 + month_5 + 
##     month_6 + poc_success + poc_failure + poc_other
## 
##               Df Deviance   AIC
## - age          1    10731 10791
## - previous     1    10731 10791
## - edu_sec      1    10731 10791
## - edu_primary  1    10731 10791
## - job_4        1    10731 10791
## - job_6        1    10731 10791
## - edu_tert     1    10732 10792
## - job_1        1    10732 10792
## <none>              10731 10793
## - poc_other    1    10733 10793
## - day          1    10733 10793
## - balance      1    10734 10794
## - job_3        1    10735 10795
## - job_5        1    10736 10796
## - divorced     1    10739 10799
## - loan         1    10741 10801
## - single       1    10746 10806
## - campaign     1    10746 10806
## - month_2      1    10749 10809
## - poc_failure  1    10754 10814
## - month_6      1    10762 10822
## - month_5      1    10784 10844
## - co_tel       1    10785 10845
## - month_1      1    10786 10846
## - housing      1    10790 10850
## - month_4      1    10796 10856
## - co_cellular  1    10797 10857
## - month_3      1    10912 10972
## - poc_success  1    10976 11036
## - ID           1    11577 11637
## - duration     1    14016 14076
## 
## Step:  AIC=10791.07
## y ~ balance + housing + loan + day + duration + campaign + previous + 
##     ID + job_1 + job_3 + job_4 + job_5 + job_6 + divorced + single + 
##     edu_primary + edu_sec + edu_tert + co_cellular + co_tel + 
##     month_1 + month_2 + month_3 + month_4 + month_5 + month_6 + 
##     poc_success + poc_failure + poc_other
## 
##               Df Deviance   AIC
## - previous     1    10731 10789
## - edu_sec      1    10732 10790
## - job_4        1    10732 10790
## - edu_primary  1    10732 10790
## - job_6        1    10732 10790
## - edu_tert     1    10732 10790
## - job_1        1    10733 10791
## <none>              10731 10791
## - poc_other    1    10733 10791
## - day          1    10734 10792
## - balance      1    10735 10793
## - job_3        1    10736 10794
## - job_5        1    10739 10797
## - divorced     1    10740 10798
## - loan         1    10741 10799
## - campaign     1    10747 10805
## - single       1    10748 10806
## - month_2      1    10750 10808
## - poc_failure  1    10754 10812
## - month_6      1    10762 10820
## - month_5      1    10784 10842
## - co_tel       1    10785 10843
## - month_1      1    10787 10845
## - housing      1    10792 10850
## - month_4      1    10797 10855
## - co_cellular  1    10797 10855
## - month_3      1    10913 10971
## - poc_success  1    10977 11035
## - ID           1    11579 11637
## - duration     1    14017 14075
## 
## Step:  AIC=10789.43
## y ~ balance + housing + loan + day + duration + campaign + ID + 
##     job_1 + job_3 + job_4 + job_5 + job_6 + divorced + single + 
##     edu_primary + edu_sec + edu_tert + co_cellular + co_tel + 
##     month_1 + month_2 + month_3 + month_4 + month_5 + month_6 + 
##     poc_success + poc_failure + poc_other
## 
##               Df Deviance   AIC
## - edu_sec      1    10732 10788
## - job_4        1    10732 10788
## - edu_primary  1    10732 10788
## - job_6        1    10732 10788
## - edu_tert     1    10733 10789
## - job_1        1    10733 10789
## <none>              10731 10789
## - poc_other    1    10734 10790
## - day          1    10734 10790
## - balance      1    10735 10791
## - job_3        1    10736 10792
## - job_5        1    10740 10796
## - divorced     1    10740 10796
## - loan         1    10742 10798
## - campaign     1    10747 10803
## - single       1    10748 10804
## - month_2      1    10750 10806
## - poc_failure  1    10755 10811
## - month_6      1    10763 10819
## - month_5      1    10785 10841
## - co_tel       1    10786 10842
## - month_1      1    10787 10843
## - housing      1    10792 10848
## - month_4      1    10797 10853
## - co_cellular  1    10798 10854
## - month_3      1    10913 10969
## - poc_success  1    11002 11058
## - ID           1    11580 11636
## - duration     1    14018 14074
## 
## Step:  AIC=10787.89
## y ~ balance + housing + loan + day + duration + campaign + ID + 
##     job_1 + job_3 + job_4 + job_5 + job_6 + divorced + single + 
##     edu_primary + edu_tert + co_cellular + co_tel + month_1 + 
##     month_2 + month_3 + month_4 + month_5 + month_6 + poc_success + 
##     poc_failure + poc_other
## 
##               Df Deviance   AIC
## - job_4        1    10732 10786
## - job_6        1    10733 10787
## - edu_tert     1    10733 10787
## - job_1        1    10734 10788
## - poc_other    1    10734 10788
## <none>              10732 10788
## - day          1    10734 10788
## - balance      1    10736 10790
## - job_3        1    10736 10790
## - edu_primary  1    10737 10791
## - job_5        1    10740 10794
## - divorced     1    10740 10794
## - loan         1    10742 10796
## - campaign     1    10747 10801
## - single       1    10749 10803
## - month_2      1    10750 10804
## - poc_failure  1    10756 10810
## - month_6      1    10763 10817
## - month_5      1    10785 10839
## - co_tel       1    10786 10840
## - month_1      1    10788 10842
## - housing      1    10792 10846
## - month_4      1    10798 10852
## - co_cellular  1    10798 10852
## - month_3      1    10914 10968
## - poc_success  1    11002 11056
## - ID           1    11580 11634
## - duration     1    14018 14072
## 
## Step:  AIC=10786.33
## y ~ balance + housing + loan + day + duration + campaign + ID + 
##     job_1 + job_3 + job_5 + job_6 + divorced + single + edu_primary + 
##     edu_tert + co_cellular + co_tel + month_1 + month_2 + month_3 + 
##     month_4 + month_5 + month_6 + poc_success + poc_failure + 
##     poc_other
## 
##               Df Deviance   AIC
## - job_6        1    10733 10785
## - edu_tert     1    10734 10786
## - job_1        1    10734 10786
## - poc_other    1    10734 10786
## <none>              10732 10786
## - day          1    10735 10787
## - balance      1    10736 10788
## - job_3        1    10736 10788
## - edu_primary  1    10737 10789
## - job_5        1    10740 10792
## - divorced     1    10741 10793
## - loan         1    10743 10795
## - campaign     1    10748 10800
## - month_2      1    10751 10803
## - single       1    10751 10803
## - poc_failure  1    10756 10808
## - month_6      1    10764 10816
## - month_5      1    10786 10838
## - co_tel       1    10786 10838
## - month_1      1    10788 10840
## - housing      1    10795 10847
## - month_4      1    10799 10851
## - co_cellular  1    10799 10851
## - month_3      1    10915 10967
## - poc_success  1    11003 11055
## - ID           1    11590 11642
## - duration     1    14018 14070
## 
## Step:  AIC=10785.15
## y ~ balance + housing + loan + day + duration + campaign + ID + 
##     job_1 + job_3 + job_5 + divorced + single + edu_primary + 
##     edu_tert + co_cellular + co_tel + month_1 + month_2 + month_3 + 
##     month_4 + month_5 + month_6 + poc_success + poc_failure + 
##     poc_other
## 
##               Df Deviance   AIC
## - edu_tert     1    10734 10784
## - poc_other    1    10735 10785
## - job_1        1    10735 10785
## <none>              10733 10785
## - day          1    10736 10786
## - balance      1    10737 10787
## - job_3        1    10738 10788
## - edu_primary  1    10738 10788
## - job_5        1    10742 10792
## - divorced     1    10742 10792
## - loan         1    10743 10793
## - campaign     1    10749 10799
## - month_2      1    10752 10802
## - single       1    10752 10802
## - poc_failure  1    10757 10807
## - month_6      1    10764 10814
## - month_5      1    10787 10837
## - co_tel       1    10787 10837
## - month_1      1    10789 10839
## - housing      1    10795 10845
## - month_4      1    10799 10849
## - co_cellular  1    10800 10850
## - month_3      1    10916 10966
## - poc_success  1    11003 11053
## - ID           1    11590 11640
## - duration     1    14018 14068
## 
## Step:  AIC=10784.24
## y ~ balance + housing + loan + day + duration + campaign + ID + 
##     job_1 + job_3 + job_5 + divorced + single + edu_primary + 
##     co_cellular + co_tel + month_1 + month_2 + month_3 + month_4 + 
##     month_5 + month_6 + poc_success + poc_failure + poc_other
## 
##               Df Deviance   AIC
## - poc_other    1    10736 10784
## <none>              10734 10784
## - day          1    10737 10785
## - job_1        1    10737 10785
## - balance      1    10738 10786
## - edu_primary  1    10740 10788
## - divorced     1    10743 10791
## - job_5        1    10743 10791
## - loan         1    10745 10793
## - job_3        1    10746 10794
## - campaign     1    10750 10798
## - month_2      1    10753 10801
## - single       1    10754 10802
## - poc_failure  1    10758 10806
## - month_6      1    10765 10813
## - month_5      1    10788 10836
## - co_tel       1    10789 10837
## - month_1      1    10790 10838
## - housing      1    10796 10844
## - co_cellular  1    10800 10848
## - month_4      1    10801 10849
## - month_3      1    10918 10966
## - poc_success  1    11005 11053
## - ID           1    11594 11642
## - duration     1    14018 14066
## 
## Step:  AIC=10784.19
## y ~ balance + housing + loan + day + duration + campaign + ID + 
##     job_1 + job_3 + job_5 + divorced + single + edu_primary + 
##     co_cellular + co_tel + month_1 + month_2 + month_3 + month_4 + 
##     month_5 + month_6 + poc_success + poc_failure
## 
##               Df Deviance   AIC
## <none>              10736 10784
## - day          1    10738 10784
## - job_1        1    10739 10785
## - balance      1    10740 10786
## - edu_primary  1    10742 10788
## - divorced     1    10745 10791
## - job_5        1    10745 10791
## - loan         1    10747 10793
## - job_3        1    10748 10794
## - campaign     1    10752 10798
## - month_2      1    10754 10800
## - single       1    10756 10802
## - poc_failure  1    10758 10804
## - month_6      1    10767 10813
## - month_5      1    10790 10836
## - co_tel       1    10790 10836
## - month_1      1    10792 10838
## - housing      1    10801 10847
## - co_cellular  1    10802 10848
## - month_4      1    10803 10849
## - month_3      1    10919 10965
## - poc_success  1    11023 11069
## - ID           1    11614 11660
## - duration     1    14020 14066

summary(fit)

## 
## Call:
## glm(formula = y ~ balance + housing + loan + day + duration + 
##     campaign + ID + job_1 + job_3 + job_5 + divorced + single + 
##     edu_primary + co_cellular + co_tel + month_1 + month_2 + 
##     month_3 + month_4 + month_5 + month_6 + poc_success + poc_failure, 
##     family = "binomial", data = train_75)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.8432  -0.3511  -0.2118  -0.1153   3.2759  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.173e+00  1.509e-01 -40.905  < 2e-16 ***
## balance      1.340e-05  6.783e-06   1.976 0.048156 *  
## housing     -4.778e-01  5.975e-02  -7.996 1.29e-15 ***
## loan        -2.631e-01  8.294e-02  -3.173 0.001510 ** 
## day          4.830e-03  3.215e-03   1.502 0.133033    
## duration     4.639e-03  9.499e-05  48.831  < 2e-16 ***
## campaign    -5.233e-02  1.388e-02  -3.770 0.000163 ***
## ID           1.018e-04  3.471e-06  29.322  < 2e-16 ***
## job_1        1.101e-01  6.854e-02   1.606 0.108177    
## job_3        2.267e-01  6.556e-02   3.458 0.000543 ***
## job_5        3.180e-01  1.041e-01   3.053 0.002265 ** 
## divorced     2.439e-01  8.239e-02   2.960 0.003072 ** 
## single       2.590e-01  5.723e-02   4.526 6.00e-06 ***
## edu_primary -2.053e-01  8.384e-02  -2.449 0.014316 *  
## co_cellular -8.516e-01  1.027e-01  -8.291  < 2e-16 ***
## co_tel      -1.026e+00  1.411e-01  -7.269 3.62e-13 ***
## month_1      5.612e-01  7.585e-02   7.399 1.37e-13 ***
## month_2      5.963e-01  1.380e-01   4.320 1.56e-05 ***
## month_3      2.270e+00  1.641e-01  13.834  < 2e-16 ***
## month_4      1.145e+00  1.388e-01   8.254  < 2e-16 ***
## month_5      7.157e-01  9.677e-02   7.397 1.40e-13 ***
## month_6      6.356e-01  1.131e-01   5.621 1.89e-08 ***
## poc_success  1.545e+00  9.288e-02  16.633  < 2e-16 ***
## poc_failure -3.617e-01  7.798e-02  -4.638 3.51e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 17389  on 23734  degrees of freedom
## Residual deviance: 10736  on 23711  degrees of freedom
## AIC: 10784
## 
## Number of Fisher Scoring iterations: 6

#lets start drop variables based on p values
formula(fit)

## y ~ balance + housing + loan + day + duration + campaign + ID + 
##     job_1 + job_3 + job_5 + divorced + single + edu_primary + 
##     co_cellular + co_tel + month_1 + month_2 + month_3 + month_4 + 
##     month_5 + month_6 + poc_success + poc_failure

#lets check the remaining significant variables
names(fit$coefficients)

##  [1] "(Intercept)" "balance"     "housing"     "loan"        "day"        
##  [6] "duration"    "campaign"    "ID"          "job_1"       "job_3"      
## [11] "job_5"       "divorced"    "single"      "edu_primary" "co_cellular"
## [16] "co_tel"      "month_1"     "month_2"     "month_3"     "month_4"    
## [21] "month_5"     "month_6"     "poc_success" "poc_failure"

#lets build final logistic model on significant variables on dataset train_75
fit_final=glm(y ~ balance + housing + loan + duration + campaign + ID + 
                job_3 + job_5 + divorced + single + edu_primary + 
                co_cellular + co_tel + month_1 + month_2 + month_3 + month_4 + 
                month_5 + month_6 + poc_success + poc_failure,
              data=train_75,family="binomial")

summary(fit_final)

## 
## Call:
## glm(formula = y ~ balance + housing + loan + duration + campaign + 
##     ID + job_3 + job_5 + divorced + single + edu_primary + co_cellular + 
##     co_tel + month_1 + month_2 + month_3 + month_4 + month_5 + 
##     month_6 + poc_success + poc_failure, family = "binomial", 
##     data = train_75)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.8400  -0.3506  -0.2118  -0.1152   3.2747  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.063e+00  1.406e-01 -43.133  < 2e-16 ***
## balance      1.360e-05  6.765e-06   2.010  0.04444 *  
## housing     -4.867e-01  5.962e-02  -8.163 3.28e-16 ***
## loan        -2.664e-01  8.289e-02  -3.214  0.00131 ** 
## duration     4.633e-03  9.493e-05  48.810  < 2e-16 ***
## campaign    -5.025e-02  1.383e-02  -3.633  0.00028 ***
## ID           1.010e-04  3.449e-06  29.298  < 2e-16 ***
## job_3        1.908e-01  6.148e-02   3.103  0.00192 ** 
## job_5        2.922e-01  1.025e-01   2.851  0.00436 ** 
## divorced     2.467e-01  8.237e-02   2.995  0.00275 ** 
## single       2.621e-01  5.717e-02   4.585 4.54e-06 ***
## edu_primary -2.274e-01  8.249e-02  -2.756  0.00585 ** 
## co_cellular -8.305e-01  1.022e-01  -8.130 4.28e-16 ***
## co_tel      -1.009e+00  1.408e-01  -7.166 7.72e-13 ***
## month_1      5.700e-01  7.577e-02   7.523 5.37e-14 ***
## month_2      5.917e-01  1.379e-01   4.292 1.77e-05 ***
## month_3      2.266e+00  1.638e-01  13.836  < 2e-16 ***
## month_4      1.167e+00  1.383e-01   8.437  < 2e-16 ***
## month_5      7.328e-01  9.600e-02   7.633 2.30e-14 ***
## month_6      5.994e-01  1.099e-01   5.453 4.95e-08 ***
## poc_success  1.548e+00  9.286e-02  16.676  < 2e-16 ***
## poc_failure -3.620e-01  7.792e-02  -4.646 3.38e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 17389  on 23734  degrees of freedom
## Residual deviance: 10741  on 23713  degrees of freedom
## AIC: 10785
## 
## Number of Fisher Scoring iterations: 6

Thus logistic regression model is successfully built.

Now lets predict scores

library(pROC)

score=predict(fit_final,newdata =test_25,type = "response")

check the performance using auc score

#Thus area under the ROC curve is:
roccurve=roc(test_25$y,score) #real outcome and predicted score
auc(roccurve)

## Area under the curve: 0.9212

Area under the curve: 0.9212. Higher the AUC better the model

Modelled probability is P(y=1) by default. Meaning, score should be high when outcome is 1 and low when otucome it 0

Lets visualise how is our eventual binary response is behaving w.r.t. score that we obtained

library(ggplot2)
mydata=data.frame(Actual=test_25$y,Predicted=score)
ggplot(mydata,aes(y=Actual,x=Predicted,color=factor(test_25$y)))+
  geom_point()+geom_jitter()

You can see that response 0 is bunched around low scores and response 1 is bunched around high scores, However there is overlap as well across score values. We need to find a cutoff in this score if we need to predict hard classes.

Lets build model on entire train data

# so the tentative score performance of logistic regression is going to be around 0.9212
# now lets build the model on entire training data

library(car)
for_vif_final=lm(y~.,data=train)
summary(for_vif_final)

## 
## Call:
## lm(formula = y ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.06429 -0.11770 -0.03423  0.03642  1.04121 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.505e-01  1.283e-02 -11.729  < 2e-16 ***
## age          3.015e-04  1.837e-04   1.641 0.100800    
## default      4.211e-03  1.143e-02   0.368 0.712674    
## balance      1.054e-06  5.003e-07   2.106 0.035204 *  
## housing     -3.438e-02  3.555e-03  -9.672  < 2e-16 ***
## loan        -1.291e-02  4.153e-03  -3.110 0.001874 ** 
## day          4.884e-04  1.926e-04   2.536 0.011233 *  
## duration     4.797e-04  5.887e-06  81.488  < 2e-16 ***
## campaign    -2.866e-04  5.015e-04  -0.571 0.567727    
## pdays       -7.456e-05  3.220e-05  -2.316 0.020582 *  
## previous     4.162e-04  7.179e-04   0.580 0.562084    
## ID           7.404e-06  2.186e-07  33.867  < 2e-16 ***
## job_1        3.799e-03  4.439e-03   0.856 0.392043    
## job_2       -4.361e-03  4.689e-03  -0.930 0.352264    
## job_3        1.029e-02  5.398e-03   1.907 0.056592 .  
## job_4        5.558e-02  1.129e-02   4.924 8.52e-07 ***
## job_5        2.802e-02  8.136e-03   3.444 0.000574 ***
## job_6       -4.295e-03  9.344e-03  -0.460 0.645792    
## divorced     1.394e-02  4.858e-03   2.870 0.004103 ** 
## single       1.722e-02  3.834e-03   4.492 7.08e-06 ***
## edu_primary -6.400e-03  8.419e-03  -0.760 0.447196    
## edu_sec      4.772e-03  7.748e-03   0.616 0.538009    
## edu_tert     1.072e-02  8.298e-03   1.292 0.196520    
## co_cellular -9.688e-02  5.808e-03 -16.682  < 2e-16 ***
## co_tel      -1.092e-01  8.066e-03 -13.535  < 2e-16 ***
## month_1      2.368e-02  4.220e-03   5.610 2.04e-08 ***
## month_2      1.007e-01  1.240e-02   8.115 5.03e-16 ***
## month_3      3.088e-01  1.525e-02  20.248  < 2e-16 ***
## month_4      1.630e-01  1.257e-02  12.967  < 2e-16 ***
## month_5      3.982e-02  6.821e-03   5.837 5.35e-09 ***
## month_6      3.573e-02  7.516e-03   4.754 2.00e-06 ***
## poc_success  3.651e-01  1.063e-02  34.353  < 2e-16 ***
## poc_failure -2.516e-02  9.567e-03  -2.630 0.008552 ** 
## poc_other   -7.933e-03  1.110e-02  -0.715 0.474773    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2656 on 31613 degrees of freedom
## Multiple R-squared:  0.3205, Adjusted R-squared:  0.3198 
## F-statistic: 451.9 on 33 and 31613 DF,  p-value: < 2.2e-16

sort(vif(for_vif_final),decreasing = T)[1:3]

##  edu_sec edu_tert    pdays 
## 6.725263 6.376653 4.658624

#So remove edu sec from train
for_vif_final=lm(y~.-edu_sec,data=train)
sort(vif(for_vif_final),decreasing = T)[1:3]

##       pdays poc_failure          ID 
##    4.658565    3.929525    3.659703

summary(for_vif_final)

## 
## Call:
## lm(formula = y ~ . - edu_sec, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.06369 -0.11762 -0.03423  0.03654  1.04150 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.458e-01  1.027e-02 -14.198  < 2e-16 ***
## age          2.911e-04  1.829e-04   1.591 0.111552    
## default      4.193e-03  1.143e-02   0.367 0.713794    
## balance      1.054e-06  5.003e-07   2.107 0.035125 *  
## housing     -3.431e-02  3.553e-03  -9.656  < 2e-16 ***
## loan        -1.279e-02  4.148e-03  -3.083 0.002048 ** 
## day          4.882e-04  1.926e-04   2.534 0.011271 *  
## duration     4.797e-04  5.887e-06  81.488  < 2e-16 ***
## campaign    -2.863e-04  5.015e-04  -0.571 0.568116    
## pdays       -7.464e-05  3.220e-05  -2.318 0.020460 *  
## previous     4.169e-04  7.179e-04   0.581 0.561378    
## ID           7.402e-06  2.186e-07  33.863  < 2e-16 ***
## job_1        3.789e-03  4.439e-03   0.854 0.393337    
## job_2       -4.341e-03  4.688e-03  -0.926 0.354497    
## job_3        1.016e-02  5.393e-03   1.884 0.059612 .  
## job_4        5.490e-02  1.123e-02   4.887 1.03e-06 ***
## job_5        2.817e-02  8.132e-03   3.464 0.000532 ***
## job_6       -4.206e-03  9.343e-03  -0.450 0.652607    
## divorced     1.404e-02  4.855e-03   2.891 0.003844 ** 
## single       1.719e-02  3.834e-03   4.484 7.35e-06 ***
## edu_primary -1.078e-02  4.507e-03  -2.391 0.016789 *  
## edu_tert     6.368e-03  4.357e-03   1.462 0.143829    
## co_cellular -9.676e-02  5.804e-03 -16.671  < 2e-16 ***
## co_tel      -1.091e-01  8.066e-03 -13.528  < 2e-16 ***
## month_1      2.367e-02  4.220e-03   5.609 2.05e-08 ***
## month_2      1.006e-01  1.240e-02   8.109 5.31e-16 ***
## month_3      3.088e-01  1.525e-02  20.246  < 2e-16 ***
## month_4      1.630e-01  1.257e-02  12.964  < 2e-16 ***
## month_5      3.982e-02  6.821e-03   5.837 5.36e-09 ***
## month_6      3.572e-02  7.516e-03   4.752 2.02e-06 ***
## poc_success  3.651e-01  1.063e-02  34.356  < 2e-16 ***
## poc_failure -2.513e-02  9.567e-03  -2.627 0.008626 ** 
## poc_other   -7.876e-03  1.110e-02  -0.710 0.477961    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2656 on 31614 degrees of freedom
## Multiple R-squared:  0.3205, Adjusted R-squared:  0.3198 
## F-statistic:   466 on 32 and 31614 DF,  p-value: < 2.2e-16

sort(vif(for_vif_final),decreasing = T)[1:3]

##       pdays poc_failure          ID 
##    4.658565    3.929525    3.659703

#Build model
fit_final_model=glm(y~.,data=train, family = "binomial") #32 predictor var
summary(fit_final_model)

## 
## Call:
## glm(formula = y ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.7192  -0.3504  -0.2138  -0.1200   3.2289  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.118e+00  2.035e-01 -30.056  < 2e-16 ***
## age          2.818e-04  2.655e-03   0.106 0.915465    
## default      8.571e-02  2.144e-01   0.400 0.689298    
## balance      1.257e-05  6.194e-06   2.029 0.042491 *  
## housing     -4.805e-01  5.325e-02  -9.024  < 2e-16 ***
## loan        -2.281e-01  7.189e-02  -3.174 0.001506 ** 
## day          5.077e-03  2.797e-03   1.815 0.069566 .  
## duration     4.533e-03  8.163e-05  55.532  < 2e-16 ***
## campaign    -5.061e-02  1.201e-02  -4.214 2.51e-05 ***
## pdays       -2.162e-04  3.476e-04  -0.622 0.534060    
## previous     3.051e-03  7.613e-03   0.401 0.688618    
## ID           1.008e-04  3.054e-06  32.999  < 2e-16 ***
## job_1        4.875e-02  6.734e-02   0.724 0.469078    
## job_2       -1.003e-01  7.676e-02  -1.306 0.191408    
## job_3        1.323e-01  7.764e-02   1.704 0.088465 .  
## job_4        2.205e-01  1.224e-01   1.802 0.071521 .  
## job_5        1.939e-01  1.100e-01   1.762 0.078006 .  
## job_6       -1.227e-01  1.309e-01  -0.937 0.348648    
## divorced     2.545e-01  7.199e-02   3.535 0.000408 ***
## single       2.187e-01  5.618e-02   3.892 9.94e-05 ***
## edu_primary -1.735e-01  1.246e-01  -1.393 0.163677    
## edu_sec      1.073e-01  1.094e-01   0.981 0.326478    
## edu_tert     1.661e-01  1.148e-01   1.447 0.147832    
## co_cellular -8.475e-01  8.822e-02  -9.606  < 2e-16 ***
## co_tel      -1.047e+00  1.227e-01  -8.534  < 2e-16 ***
## month_1      4.837e-01  6.624e-02   7.302 2.83e-13 ***
## month_2      5.894e-01  1.197e-01   4.923 8.54e-07 ***
## month_3      2.152e+00  1.420e-01  15.154  < 2e-16 ***
## month_4      1.016e+00  1.217e-01   8.345  < 2e-16 ***
## month_5      6.530e-01  8.433e-02   7.743 9.71e-15 ***
## month_6      6.478e-01  9.789e-02   6.617 3.66e-11 ***
## poc_success  1.560e+00  1.019e-01  15.318  < 2e-16 ***
## poc_failure -3.775e-01  1.087e-01  -3.474 0.000512 ***
## poc_other   -2.221e-01  1.256e-01  -1.768 0.077107 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22913  on 31646  degrees of freedom
## Residual deviance: 14281  on 31613  degrees of freedom
## AIC: 14349
## 
## Number of Fisher Scoring iterations: 6

#Remove variable having p>0.5 one by one
#fit=step(fit_final_model)
summary(fit_final_model)

## 
## Call:
## glm(formula = y ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.7192  -0.3504  -0.2138  -0.1200   3.2289  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.118e+00  2.035e-01 -30.056  < 2e-16 ***
## age          2.818e-04  2.655e-03   0.106 0.915465    
## default      8.571e-02  2.144e-01   0.400 0.689298    
## balance      1.257e-05  6.194e-06   2.029 0.042491 *  
## housing     -4.805e-01  5.325e-02  -9.024  < 2e-16 ***
## loan        -2.281e-01  7.189e-02  -3.174 0.001506 ** 
## day          5.077e-03  2.797e-03   1.815 0.069566 .  
## duration     4.533e-03  8.163e-05  55.532  < 2e-16 ***
## campaign    -5.061e-02  1.201e-02  -4.214 2.51e-05 ***
## pdays       -2.162e-04  3.476e-04  -0.622 0.534060    
## previous     3.051e-03  7.613e-03   0.401 0.688618    
## ID           1.008e-04  3.054e-06  32.999  < 2e-16 ***
## job_1        4.875e-02  6.734e-02   0.724 0.469078    
## job_2       -1.003e-01  7.676e-02  -1.306 0.191408    
## job_3        1.323e-01  7.764e-02   1.704 0.088465 .  
## job_4        2.205e-01  1.224e-01   1.802 0.071521 .  
## job_5        1.939e-01  1.100e-01   1.762 0.078006 .  
## job_6       -1.227e-01  1.309e-01  -0.937 0.348648    
## divorced     2.545e-01  7.199e-02   3.535 0.000408 ***
## single       2.187e-01  5.618e-02   3.892 9.94e-05 ***
## edu_primary -1.735e-01  1.246e-01  -1.393 0.163677    
## edu_sec      1.073e-01  1.094e-01   0.981 0.326478    
## edu_tert     1.661e-01  1.148e-01   1.447 0.147832    
## co_cellular -8.475e-01  8.822e-02  -9.606  < 2e-16 ***
## co_tel      -1.047e+00  1.227e-01  -8.534  < 2e-16 ***
## month_1      4.837e-01  6.624e-02   7.302 2.83e-13 ***
## month_2      5.894e-01  1.197e-01   4.923 8.54e-07 ***
## month_3      2.152e+00  1.420e-01  15.154  < 2e-16 ***
## month_4      1.016e+00  1.217e-01   8.345  < 2e-16 ***
## month_5      6.530e-01  8.433e-02   7.743 9.71e-15 ***
## month_6      6.478e-01  9.789e-02   6.617 3.66e-11 ***
## poc_success  1.560e+00  1.019e-01  15.318  < 2e-16 ***
## poc_failure -3.775e-01  1.087e-01  -3.474 0.000512 ***
## poc_other   -2.221e-01  1.256e-01  -1.768 0.077107 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22913  on 31646  degrees of freedom
## Residual deviance: 14281  on 31613  degrees of freedom
## AIC: 14349
## 
## Number of Fisher Scoring iterations: 6

#lets start drop variables based on p values
#formula(fit_final_model)

#Now based on this summery result remove variables (i.e dont add)having pi value >0.05.

fit_final_model=glm(y ~ balance + housing + loan + duration + 
                      campaign + pdays + ID + job_3 + 
                       job_5 + divorced + single + edu_primary + 
                      co_cellular + co_tel + month_1 + month_2 + month_3 + month_4 + 
                      month_5 + month_6 + poc_success + poc_failure ,
                data=train,family="binomial")

summary(fit_final_model)

## 
## Call:
## glm(formula = y ~ balance + housing + loan + duration + campaign + 
##     pdays + ID + job_3 + job_5 + divorced + single + edu_primary + 
##     co_cellular + co_tel + month_1 + month_2 + month_3 + month_4 + 
##     month_5 + month_6 + poc_success + poc_failure, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.7197  -0.3519  -0.2140  -0.1196   3.2387  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.920e+00  1.203e-01 -49.226  < 2e-16 ***
## balance      1.273e-05  6.159e-06   2.067 0.038769 *  
## housing     -4.930e-01  5.225e-02  -9.436  < 2e-16 ***
## loan        -2.309e-01  7.146e-02  -3.232 0.001231 ** 
## duration     4.521e-03  8.147e-05  55.498  < 2e-16 ***
## campaign    -4.789e-02  1.195e-02  -4.008 6.11e-05 ***
## pdays       -6.003e-04  2.726e-04  -2.202 0.027661 *  
## ID           1.001e-04  3.000e-06  33.362  < 2e-16 ***
## job_3        1.649e-01  5.340e-02   3.087 0.002019 ** 
## job_5        2.055e-01  8.914e-02   2.306 0.021115 *  
## divorced     2.579e-01  7.169e-02   3.597 0.000322 ***
## single       2.503e-01  4.948e-02   5.059 4.20e-07 ***
## edu_primary -3.009e-01  7.264e-02  -4.143 3.43e-05 ***
## co_cellular -8.296e-01  8.774e-02  -9.455  < 2e-16 ***
## co_tel      -1.034e+00  1.216e-01  -8.506  < 2e-16 ***
## month_1      4.856e-01  6.593e-02   7.365 1.77e-13 ***
## month_2      5.814e-01  1.195e-01   4.866 1.14e-06 ***
## month_3      2.151e+00  1.416e-01  15.189  < 2e-16 ***
## month_4      1.032e+00  1.209e-01   8.531  < 2e-16 ***
## month_5      6.687e-01  8.358e-02   8.000 1.24e-15 ***
## month_6      5.928e-01  9.448e-02   6.274 3.51e-10 ***
## poc_success  1.653e+00  8.839e-02  18.696  < 2e-16 ***
## poc_failure -2.662e-01  8.724e-02  -3.051 0.002280 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22913  on 31646  degrees of freedom
## Residual deviance: 14298  on 31624  degrees of freedom
## AIC: 14344
## 
## Number of Fisher Scoring iterations: 6

now if we needed to submit probability scores for the test data we are done at this point We write type = reponse to get probabilities score

test.prob.score= predict(fit_final_model,newdata = test,type='response')

write.csv(test.prob.score,"Neha_Raut_Probabilities.csv",row.names = F)

4. Finding Cutoff value and Perfomance measurements of the model.

lets find cutoff based on these probability scores.

You can find cutoff using prediction and performance parameters or using KS

1.using prediction and performance parameters

train.score=predict(fit_final_model,newdata = train,type="response")
real=train$y

#Deciding cutoff
library(ROCR)
#We will need two paramtere : prediction and performance
ROCRPred=prediction(train.score,train$y)
ROCRPref=performance(ROCRPred,"tpr","fpr") #true positive and false positive


plot(ROCRPref,colorize=TRUE,print.cutoffs.at=seq(0.1,by=0.1))  #cutoff comes 0.1

#OR
#plot(ROCRPref,colorize=TRUE,print.cutoffs.at=seq(0.1,1,by=100))

#OR

res.roc <- roc(train$y,train.score)
coords(res.roc, "best")

##   threshold specificity sensitivity 
##   0.1103465   0.8310595   0.8787634

plot.roc(res.roc, print.auc = TRUE,print.thres = "best") #cutoff comes 0.110

Creating confusion matrix and find how good our model is (by predicting on test_25 dataset)

#Try for cutoff 0.110
table(ActualValue=test_25$y,predictedValue=score>0.110)

##            predictedValue
## ActualValue FALSE TRUE
##           0  5877 1156
##           1   108  771

Accuracy=(771+5877)/(771+5877+108+ 1156)
Accuracy#0.8402427

## [1] 0.8402427

table(ActualValue=test_25$y,predictedValue=score>0.3)

##            predictedValue
## ActualValue FALSE TRUE
##           0  6616  417
##           1   345  534

#Accuracy=(534+6616)/(534+6616+345+417)
#Accuracy#0.9036906

table(ActualValue=test_25$y,predictedValue=score>0.5)

##            predictedValue
## ActualValue FALSE TRUE
##           0  6843  190
##           1   525  354

Accuracy=(6843+354)/(6843+354+525+190)
#Accuracy#0.9096309

from above we can see that TN is low for 0.1, and high for 0.3 and 0.5 hence our cutoff is right

TP=771
FP=1156
P=1927
TN=5877
FN=108
N=5985

Accuracy=(TP+TN)/(P+N)
Accuracy

## [1] 0.8402427

Sn=TP/P
Sp=TN/N #specificity
KS=(TP/P)-(FP/N)
Precision=TP/(TP+FP)
Recall=TP/P

2. Using KS Method

we’ll start with calculating proabbility scores on training data and making a base data with single obs where we’ll store our values from the for loop

train.score=predict(fit_final_model,newdata = train,type="response")
real=train$y

cutoff_data=data.frame(cutoff=0,TP=0,FP=0,FN=0,TN=0)
cutoffs=seq(0,1,length=100)

We’ll go through all the cutoffs and for each we’ll store the calculated values

for (i in cutoffs){
  predicted=as.numeric(train.score>i)
  
  TP=sum(predicted==1 & train$y==1)
  FP=sum(predicted==1 & train$y==0)
  FN=sum(predicted==0 & train$y==1)
  TN=sum(predicted==0 & train$y==0)
  cutoff_data=rbind(cutoff_data,c(i,TP,FP,FN,TN))
}

## lets remove the dummy data cotaining top row in data frame cutoff_data
cutoff_data=cutoff_data[-1,]
#we now have 100 obs in df cutoff_data

lets calculate the performance measures:sensitivity,specificity,accuracy, KS and precision.

cutoff_data=cutoff_data %>%
  mutate(P=FN+TP,N=TN+FP, #total positives and negatives
         Sn=TP/P, #sensitivity
         Sp=TN/N, #specificity
         KS=abs((TP/P)-(FP/N)),
         Accuracy=(TP+TN)/(P+N),
         Precision=TP/(TP+FP),
         Recall=TP/P
  ) %>% 
  select(-P,-N)

lets view cutoff dataset:

#View(cutoff_data)

visualise how these measures(Individual Values) move across cutoffs

ggplot(cutoff_data,aes(x=cutoff,y=Sp))+geom_line()

ggplot(cutoff_data,aes(x=cutoff,y=Sn))+geom_line()

ggplot(cutoff_data,aes(x=cutoff,y=KS))+geom_line()

#If you want to look at all of this
library(tidyr)

## Warning: package 'tidyr' was built under R version 3.4.4

library(ggplot2)
cutoff_long=cutoff_data %>% 
  gather(Measure,Value,KS,Sn,Sp)

ggplot(cutoff_long,aes(x=cutoff,y=Value,color=Measure))+geom_line()

Lets find cutoff value based on ks MAXIMUM.

#Determine CutOff based on KS
KS_cutoff=cutoff_data$cutoff[which.max(cutoff_data$KS)]
KS_cutoff

## [1] 0.1111111

hence 0.1111111 is the cutoff value by ks max method.

Step 5.Predict the final output on test dataset.(whether the client subscribe or no to term deposit)

test.score=predict(fit_final_model,newdata =test,type = "response")#on final test dataset.

_ Predicting whether the client has subscribed or no in final test dataset.

FinalScore=as.numeric(test.score>KS_cutoff)#if score is > cutoff then true(1) else false(0)
table(FinalScore)

## FinalScore
##     0     1 
## 10173  3391

#Thus final prediction is as follows:
testFinal=factor(FinalScore,levels = c(0,1),labels=c("no","yes"))
table(testFinal)

## testFinal
##    no   yes 
## 10173  3391

#write.csv(test$leftfinal,"P5_sub_1.csv")
write.csv(testFinal,"Neha_Raut_P5_part2.csv",row.names = F)