The primary function of bank is to accept deposits for the purpose of lending. They mobilize capital by offering interests at specific rates and lend the money to individuals, institutions based on their profile and requirements at a higher interest rate thereby making profit. Banks normally promote their products (Services) through direct marketing campaign. The data I analysed is from a Portugal based bank which did direct marketing campaign to sell a term deposit product by making phone calls. The data has been generated based on the campaign targeting the potential individuals in year 2008 to 2010.

The business objective for this project is to identify the target customers who will subscribe to the newly launched term deposit plan in the Portuguese Bank by analysing few indicators which includes the client’s data available in the bank along with few more constructs on the marketing campaign. The Objectives are broadly classified into two: 1) Determine the variables that are statistically significant and thereby influencing the purchase of term deposit 2) Determine the probability of each customer who will subscribe to the term deposit. The success criteria would be the “Accuracy with which the potential customers are identified for the term deposit subscription at an early stage without the corresponding risk of incorrectly tagging the non-subscribers as potential subscribers and spending the time and money on them”.

rm(list=ls())
getwd()
## [1] "C:/PGPMFX/Term5/BA/Project"
setwd("C:/PGPMFX/Term5/BA/Project")
bank <-read.csv("bank-full.csv", sep = ";")
library(psych)
headTail(bank)
str(bank)
## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

The data contains 45211 records with 16 independent variables and 1 dependent variable. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (or not) subscribed. Data is mix of Continuous and Categorical variables including demographic details. The predictor variable is probability (0 to 1) value. This dataset has no missed out values and hence no imputation is necessary. Moreover, there is no need to code as the variable types are already correct. The dependent variable - term deposit subscription alone was recoded to 0 and 1 binary instead of two factors - yes and no. The Null Hypothesis H0 is that there is no significant relationship between the term deposit subscription and the remaining 16 variables. Alternate Hypothesis, H1 is that there is significant relationship between the dependent and independent variables.

Including Plots

Bar Charts:

# Grouped Bar Plot
barplot(table(bank$y, bank$education), main="Distribution of Education vs Deposit",
        xlab="Education", col=c("darkblue","red"),
        legend = rownames(table(bank$y, bank$education)), beside=TRUE)

The above chart of Education Vs Deposit shows that clients with secondary education is high but their subscription to term deposit is not so high.

barplot(table(bank$y, bank$marital), main="Distribution of Marital Status vs Deposit",
        xlab="Marital Status", col=c("darkblue","red"),
        legend = rownames(table(bank$y, bank$marital)), beside=TRUE)

The above chart of marital status vs Deposit shows that the large number of clients are married. In absolute terms, their subscription is high. But in relative terms, the clients who are single are subscribed higher.

Pie Chart

# Pie Chart of Term Deposit Subscription
library(plotrix)
## 
## Attaching package: 'plotrix'
## The following object is masked from 'package:psych':
## 
##     rescale
tab_sub <- table(bank$y)
pct <- round(tab_sub/sum(tab_sub)*100)
#lbls <- paste(lbls, pct) # add percents to labels 
lbls <- paste(pct,"%",sep="") # ad % to labels
pie3D(tab_sub,labels=lbls,explode=0.5,
      main="Pie Chart of Term Deposit Subscription ")

The above Pie chart shows that 12% of the clients are subscribed and 88% of the clients are not subscribed to term deposit. The data is imbalanced. The analysis were done with and without imbalanced data.

Histograms Histograms are created for continuous data. The outliers are truncated to understand the data better. Age seems to be normally distributed while the other continuous variables like Bank Balance, no. of contacts made for a customer in present campaign, duration of the call etc. are not normally distributed. There are outliers in these variables.

hist(bank$age, col = "green")

hist(bank$balance, col = "blue", xlim = c(0,20000))

hist(bank$campaign, col = "cyan", xlim = c(0,30))

hist(bank$day, col = "magenta")

hist(bank$duration, col = "violet", xlim = c(0,2000))

hist(bank$pdays, col = "dark green", xlim = c(0,400))

hist(bank$previous, col = "dark blue", xlim = c(0,50))

Box Plots Box plots of continuous variables are created with respect to the subscription. These give an understanding of the outliers present in the data.

boxplot(age~y, data=bank, notch=TRUE, 
        col=(c("gold","darkgreen")),
        main="Age and Subscription", xlab="Subscription")

boxplot(balance~y, data=bank, notch=TRUE, 
        col=(c("gold","blue")),
        main="Balance and Subscription", xlab="Subscription")

boxplot(campaign~y, data=bank, notch=TRUE, 
        col=(c("gold","cyan")),
        main="Campaign and Subscription", xlab="Subscription")

boxplot(day~y, data=bank, notch=FALSE, 
        col=(c("gold","magenta")),
        main="Day and Subscription", xlab="Subscription")

boxplot(duration~y, data=bank, notch=TRUE, 
        col=(c("gold","violet")),
        main="Duration and Subscription", xlab="Subscription")

boxplot(pdays~y, data=bank, notch=FALSE, 
        col=(c("gold","dark green")),
        main="Days Passed and Subscription", xlab="Subscription")

boxplot(previous~y, data=bank, notch=FALSE, 
        col=(c("gold","dark blue")),
        main="Previous Campaign and Subscription", xlab="Subscription")

Descriptive Statistics

#Descriptive Stats Summary
library(dlookr)
## 
## Attaching package: 'dlookr'
## The following object is masked from 'package:psych':
## 
##     describe
## The following object is masked from 'package:base':
## 
##     transform
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
describe(bank)

Descriptive Statistics shows the Skewness present in the dataset. To remove the outliers and to make the data to be normally distributed, further analysis carried out. Below graphs shows how the data can be transformed and which transformation will work.

plot_normality(bank, balance)
## Warning in log(x): NaNs produced
## Warning in sqrt(x): NaNs produced

plot_normality(bank, day)

plot_normality(bank, duration)

plot_normality(bank,campaign) 

plot_normality(bank, pdays)
## Warning in log(x): NaNs produced

## Warning in log(x): NaNs produced

plot_normality(bank, previous)

plot_correlate(bank)

Correlation of the continuous variables shows that none of the variables are strongly correlated with respect to the others. pdays and previous have weak correlation.

Plot Outliers To further analyse the outliers, outliers plot were created. The objective of these outliers plot is to analyse whether by “capping” the bottom 5 percentile and top 95 percentile will make the distribution to be smoother since I don’t want to make all the data to be normally distributed because of the data nature.

bank %>%
  plot_outlier(balance)

bank %>%
  plot_outlier(day)

bank %>%
  plot_outlier(duration)

bank %>%
  plot_outlier(campaign)

bank %>%
  plot_outlier(pdays)

bank %>%
  plot_outlier(previous)

Data Standardization: As mentioned above, the outliers the bottom 5 percentile and top 95 percentile data were capped for some of the continuous variables as given below. The capping resulted in reduction in outliers. The capped data are binded to the master data as separate columns.

# Imputation of Outlier through 95% upper and 5% lower Capping 

age_imp <- imputate_outlier(bank, age, method = "capping")
summary(age_imp)
## Impute outliers with capping
## 
## * Information of Imputation (before vs after)
##              Original    Imputation
## n        4.521100e+04  4.521100e+04
## na       0.000000e+00  0.000000e+00
## mean     4.093621e+01  4.074431e+01
## sd       1.061876e+01  1.010553e+01
## se_mean  4.994038e-02  4.752661e-02
## IQR      1.500000e+01  1.500000e+01
## skewness 6.848179e-01  4.039484e-01
## kurtosis 3.195704e-01 -7.281184e-01
## p00      1.800000e+01  1.800000e+01
## p01      2.300000e+01  2.300000e+01
## p05      2.700000e+01  2.700000e+01
## p10      2.900000e+01  2.900000e+01
## p20      3.200000e+01  3.200000e+01
## p25      3.300000e+01  3.300000e+01
## p30      3.400000e+01  3.400000e+01
## p40      3.600000e+01  3.600000e+01
## p50      3.900000e+01  3.900000e+01
## p60      4.200000e+01  4.200000e+01
## p70      4.600000e+01  4.600000e+01
## p75      4.800000e+01  4.800000e+01
## p80      5.100000e+01  5.100000e+01
## p90      5.600000e+01  5.600000e+01
## p95      5.900000e+01  5.900000e+01
## p99      7.100000e+01  6.300000e+01
## p100     9.500000e+01  7.000000e+01
plot(age_imp)

bank <-cbind(bank,age_imp)

balance_imp <- imputate_outlier(bank, balance, method = "capping")
summary(balance_imp)
## Impute outliers with capping
## 
## * Information of Imputation (before vs after)
##               Original   Imputation
## n         45211.000000 45211.000000
## na            0.000000     0.000000
## mean       1362.272058  1174.716949
## sd         3044.765829  1758.845268
## se_mean      14.319631     8.271906
## IQR        1356.000000  1356.000000
## skewness      8.360308     1.771347
## kurtosis    140.751547     2.019813
## p00       -8019.000000 -1944.000000
## p01        -627.000000  -616.900000
## p05        -172.000000  -172.000000
## p10           0.000000     0.000000
## p20          22.000000    22.000000
## p25          72.000000    72.000000
## p30         131.000000   131.000000
## p40         272.000000   272.000000
## p50         448.000000   448.000000
## p60         701.000000   701.000000
## p70        1126.000000  1126.000000
## p75        1428.000000  1428.000000
## p80        1859.000000  1859.000000
## p90        3574.000000  5768.000000
## p95        5768.000000  5768.000000
## p99       13164.900000  5768.000000
## p100     102127.000000  5768.000000
plot(balance_imp)

bank <-cbind(bank,balance_imp)

campaign_imp <- imputate_outlier(bank, campaign, method = "capping")
summary(campaign_imp)
## Impute outliers with capping
## 
## * Information of Imputation (before vs after)
##              Original   Imputation
## n        4.521100e+04 4.521100e+04
## na       0.000000e+00 0.000000e+00
## mean     2.763841e+00 2.527770e+00
## sd       3.098021e+00 1.947076e+00
## se_mean  1.457009e-02 9.157162e-03
## IQR      2.000000e+00 2.000000e+00
## skewness 4.898650e+00 1.588853e+00
## kurtosis 3.924965e+01 1.793558e+00
## p00      1.000000e+00 1.000000e+00
## p01      1.000000e+00 1.000000e+00
## p05      1.000000e+00 1.000000e+00
## p10      1.000000e+00 1.000000e+00
## p20      1.000000e+00 1.000000e+00
## p25      1.000000e+00 1.000000e+00
## p30      1.000000e+00 1.000000e+00
## p40      2.000000e+00 2.000000e+00
## p50      2.000000e+00 2.000000e+00
## p60      2.000000e+00 2.000000e+00
## p70      3.000000e+00 3.000000e+00
## p75      3.000000e+00 3.000000e+00
## p80      4.000000e+00 4.000000e+00
## p90      5.000000e+00 5.000000e+00
## p95      8.000000e+00 8.000000e+00
## p99      1.600000e+01 8.000000e+00
## p100     6.300000e+01 8.000000e+00
plot(campaign_imp)

bank <-cbind(bank,campaign_imp)

duration_imp <- imputate_outlier(bank, duration, method = "capping")
summary(duration_imp)
## Impute outliers with capping
## 
## * Information of Imputation (before vs after)
##              Original   Imputation
## n        45211.000000 4.521100e+04
## na           0.000000 0.000000e+00
## mean       258.163080 2.426671e+02
## sd         257.527812 1.957610e+02
## se_mean      1.211162 9.206703e-01
## IQR        216.000000 2.160000e+02
## skewness     3.144318 1.308020e+00
## kurtosis    18.153915 9.542390e-01
## p00          0.000000 0.000000e+00
## p01         11.000000 1.100000e+01
## p05         35.000000 3.500000e+01
## p10         58.000000 5.800000e+01
## p20         89.000000 8.900000e+01
## p25        103.000000 1.030000e+02
## p30        117.000000 1.170000e+02
## p40        147.000000 1.470000e+02
## p50        180.000000 1.800000e+02
## p60        223.000000 2.230000e+02
## p70        280.000000 2.800000e+02
## p75        319.000000 3.190000e+02
## p80        368.000000 3.680000e+02
## p90        548.000000 5.480000e+02
## p95        751.000000 7.510000e+02
## p99       1269.000000 7.510000e+02
## p100      4918.000000 7.510000e+02
plot(duration_imp)

bank <-cbind(bank,duration_imp)

pdays_imp <- imputate_outlier(bank, pdays, method = "capping")
summary(pdays_imp)
## Impute outliers with capping
## 
## * Information of Imputation (before vs after)
##               Original    Imputation
## n        45211.0000000 45211.0000000
## na           0.0000000     0.0000000
## mean        40.1978280    57.0771494
## sd         100.1287460   122.8654228
## se_mean      0.4709087     0.5778400
## IQR          0.0000000     0.0000000
## skewness     2.6157155     1.6428921
## kurtosis     6.9351952     0.6991253
## p00         -1.0000000    -1.0000000
## p01         -1.0000000    -1.0000000
## p05         -1.0000000    -1.0000000
## p10         -1.0000000    -1.0000000
## p20         -1.0000000    -1.0000000
## p25         -1.0000000    -1.0000000
## p30         -1.0000000    -1.0000000
## p40         -1.0000000    -1.0000000
## p50         -1.0000000    -1.0000000
## p60         -1.0000000    -1.0000000
## p70         -1.0000000    -1.0000000
## p75         -1.0000000    -1.0000000
## p80         -1.0000000    -1.0000000
## p90        185.0000000   317.0000000
## p95        317.0000000   317.0000000
## p99        370.0000000   317.0000000
## p100       871.0000000   317.0000000
plot(pdays_imp)

bank <-cbind(bank,pdays_imp)

previous_imp <- imputate_outlier(bank, previous, method = "capping")
summary(previous_imp)
## Impute outliers with capping
## 
## * Information of Imputation (before vs after)
##              Original   Imputation
## n        4.521100e+04 4.521100e+04
## na       0.000000e+00 0.000000e+00
## mean     5.803234e-01 5.478976e-01
## sd       2.303441e+00 1.159108e+00
## se_mean  1.083316e-02 5.451321e-03
## IQR      0.000000e+00 0.000000e+00
## skewness 4.184645e+01 1.642892e+00
## kurtosis 4.506861e+03 6.991253e-01
## p00      0.000000e+00 0.000000e+00
## p01      0.000000e+00 0.000000e+00
## p05      0.000000e+00 0.000000e+00
## p10      0.000000e+00 0.000000e+00
## p20      0.000000e+00 0.000000e+00
## p25      0.000000e+00 0.000000e+00
## p30      0.000000e+00 0.000000e+00
## p40      0.000000e+00 0.000000e+00
## p50      0.000000e+00 0.000000e+00
## p60      0.000000e+00 0.000000e+00
## p70      0.000000e+00 0.000000e+00
## p75      0.000000e+00 0.000000e+00
## p80      0.000000e+00 0.000000e+00
## p90      2.000000e+00 3.000000e+00
## p95      3.000000e+00 3.000000e+00
## p99      8.900000e+00 3.000000e+00
## p100     2.750000e+02 3.000000e+00
plot(previous_imp)

bank <-cbind(bank,previous_imp)

Box plots for Outliers minimized data: Box plots are again created with the capped data to see the distribution. Compared to the original data, these capped data distribution is good without affecting its distribution much.

#Box Plots - Imputed & Outliers minimized variables
boxplot(age_imp~y, data=bank, notch=TRUE, 
        col=(c("gold","darkgreen")),
        main="Age and Subscription", xlab="Subscription")

boxplot(balance_imp~y, data=bank, notch=TRUE, 
        col=(c("gold","blue")),
        main="Balance and Subscription", xlab="Subscription")

boxplot(campaign_imp~y, data=bank, notch=TRUE, 
        col=(c("gold","cyan")),
        main="Campaign and Subscription", xlab="Subscription")

boxplot(duration_imp~y, data=bank, notch=TRUE, 
        col=(c("gold","violet")),
        main="Duration and Subscription", xlab="Subscription")

boxplot(pdays_imp~y, data=bank, notch=FALSE, 
        col=(c("gold","dark green")),
        main="Days Passed and Subscription", xlab="Subscription")

boxplot(previous_imp~y, data=bank, notch=FALSE, 
        col=(c("gold","dark blue")),
        main="Previous Campaign and Subscription", xlab="Subscription")

Category reduction in variables - Analysis by percentage and absolute numbers There are categorical variables which has more levels like Job, marital status etc. These needs to be analysed and grouped together based on the similarity of the percentage of events occurring when compared to the total of events and non-events.

table(bank$job, bank$y)
##                
##                   no  yes
##   admin.        4540  631
##   blue-collar   9024  708
##   entrepreneur  1364  123
##   housemaid     1131  109
##   management    8157 1301
##   retired       1748  516
##   self-employed 1392  187
##   services      3785  369
##   student        669  269
##   technician    6757  840
##   unemployed    1101  202
##   unknown        254   34
round(prop.table(table(bank$job, bank$y),1),2)
##                
##                   no  yes
##   admin.        0.88 0.12
##   blue-collar   0.93 0.07
##   entrepreneur  0.92 0.08
##   housemaid     0.91 0.09
##   management    0.86 0.14
##   retired       0.77 0.23
##   self-employed 0.88 0.12
##   services      0.91 0.09
##   student       0.71 0.29
##   technician    0.89 0.11
##   unemployed    0.84 0.16
##   unknown       0.88 0.12
table(bank$education, bank$y)
##            
##                no   yes
##   primary    6260   591
##   secondary 20752  2450
##   tertiary  11305  1996
##   unknown    1605   252
round(prop.table(table(bank$education, bank$y),1),2)
##            
##               no  yes
##   primary   0.91 0.09
##   secondary 0.89 0.11
##   tertiary  0.85 0.15
##   unknown   0.86 0.14
table(bank$month, bank$y)
##      
##          no   yes
##   apr  2355   577
##   aug  5559   688
##   dec   114   100
##   feb  2208   441
##   jan  1261   142
##   jul  6268   627
##   jun  4795   546
##   mar   229   248
##   may 12841   925
##   nov  3567   403
##   oct   415   323
##   sep   310   269
round(prop.table(table(bank$month, bank$y),1),2)
##      
##         no  yes
##   apr 0.80 0.20
##   aug 0.89 0.11
##   dec 0.53 0.47
##   feb 0.83 0.17
##   jan 0.90 0.10
##   jul 0.91 0.09
##   jun 0.90 0.10
##   mar 0.48 0.52
##   may 0.93 0.07
##   nov 0.90 0.10
##   oct 0.56 0.44
##   sep 0.54 0.46
table(bank$marital, bank$y)
##           
##               no   yes
##   divorced  4585   622
##   married  24459  2755
##   single   10878  1912
round(prop.table(table(bank$marital, bank$y),1),2)
##           
##              no  yes
##   divorced 0.88 0.12
##   married  0.90 0.10
##   single   0.85 0.15
table(bank$poutcome, bank$y)
##          
##              no   yes
##   failure  4283   618
##   other    1533   307
##   success   533   978
##   unknown 33573  3386
round(prop.table(table(bank$poutcome, bank$y),1),2)
##          
##             no  yes
##   failure 0.87 0.13
##   other   0.83 0.17
##   success 0.35 0.65
##   unknown 0.91 0.09
table(bank$contact, bank$y)
##            
##                no   yes
##   cellular  24916  4369
##   telephone  2516   390
##   unknown   12490   530
round(prop.table(table(bank$contact, bank$y),1),2)
##            
##               no  yes
##   cellular  0.85 0.15
##   telephone 0.87 0.13
##   unknown   0.96 0.04
#Category reduction in variables by similar percentage of events using Rockchalk library

library(rockchalk)
## 
## Attaching package: 'rockchalk'
## The following object is masked from 'package:dplyr':
## 
##     summarize
#Combine Job levels to 4
job_level1 <-combineLevels(bank$job,c("student","retired"),newLabel=c("st_re"))
job_level2 <-combineLevels(job_level1,c("unemployed","management"),newLabel=c("une_mgmt"))
job_level3 <-combineLevels(job_level2,c("blue-collar","entrepreneur","housemaid",
                                     "services"),newLabel=c("bl_en_hm_se"))
job_level <-combineLevels(job_level3,c("admin.","self-employed","technician",
                                        "unknown"),newLabel=c("ad_sel_te_unk"))
table(job_level)
bank <-cbind(bank,job_level)
#Combine month levels to 3 
month_level1 <-combineLevels(bank$month,c("aug","jan","jun","jul","may","nov"),newLabel=c("Ja_Ma_Ju_Jul_Au_No"))
month_level2 <-combineLevels(month_level1,c("mar","sep","oct","dec"),newLabel=c("Mar_Se_Oc_De"))
month_level <-combineLevels(month_level2,c("apr","feb"),newLabel=c("Fe_Ap"))
table(month_level)
bank <-cbind(bank,month_level)
#Combine education levels to 2
edu_level1 <-combineLevels(bank$education,c("primary","secondary"),newLabel=c("Pri_Sec"))
edu_level <-combineLevels(edu_level1,c("tertiary","unknown"),newLabel=c("ter_unk"))
table(edu_level)
bank <-cbind(bank,edu_level)
#Combine marital levels to 2
marital_level <-combineLevels(bank$marital,c("single","divorced"),newLabel=c("Si_Di"))
table(marital_level)
bank <-cbind(bank,marital_level)
#Combine poutcome levels to 2
poutcome_level <-combineLevels(bank$poutcome,c("failure","other","unknown"),newLabel=c("rest"))
table(poutcome_level)
bank <-cbind(bank,poutcome_level)
#Correlation of Continuous Variables - Before Outlier reduction
cor(bank[,c(6,12,13,14,15)])
##               balance     duration    campaign        pdays     previous
## balance   1.000000000  0.021560380 -0.01457828  0.003435322  0.016673637
## duration  0.021560380  1.000000000 -0.08456950 -0.001564770  0.001203057
## campaign -0.014578279 -0.084569503  1.00000000 -0.088627668 -0.032855290
## pdays     0.003435322 -0.001564770 -0.08862767  1.000000000  0.454819635
## previous  0.016673637  0.001203057 -0.03285529  0.454819635  1.000000000
#Correlation of Continuous Variables - After Outlier reduction
cor(bank[,c(19,20,21,22,23)])
##              balance_imp campaign_imp duration_imp   pdays_imp
## balance_imp   1.00000000  -0.02101371   0.03844600  0.05067016
## campaign_imp -0.02101371   1.00000000  -0.09887499 -0.11572108
## duration_imp  0.03844600  -0.09887499   1.00000000  0.01319840
## pdays_imp     0.05067016  -0.11572108   0.01319840  1.00000000
## previous_imp  0.05067016  -0.11572108   0.01319840  1.00000000
##              previous_imp
## balance_imp    0.05067016
## campaign_imp  -0.11572108
## duration_imp   0.01319840
## pdays_imp      1.00000000
## previous_imp   1.00000000

Correlation of Continuous variables after the outlier reduction shows that pdays and previous are correlated and hence one can be removed. Here I removed pdays.

Data Partition: Data has been separated to Training and Test dataset with probability of 0.5 (exactly split into two) based on the dependent variable.

#Data Classification to Train and Test
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
set.seed(1234)
train <- createDataPartition(bank$y, list=FALSE) #create 2 set of data by Random sampling
train

bank.train <- bank[train, ]
bank.test <- bank[-train, ] #-train is complement to train data
#Convert Yes No to 1 and 0 respectively
bank.train$ObsY <- ifelse (bank.train$y == "yes", 1,0)
bank.test$ObsY <- ifelse (bank.test$y == "yes", 1,0)
bank.test$ObsY <- as.integer(bank.test$ObsY)
bank.train$ObsY <- as.integer(bank.train$ObsY)
headTail(bank.train)
headTail(bank.test)
summary(bank.train)
##       age                 job           marital          education    
##  Min.   :18.00   blue-collar:4858   divorced: 2633   primary  : 3408  
##  1st Qu.:33.00   management :4719   married :13568   secondary:11625  
##  Median :39.00   technician :3795   single  : 6405   tertiary : 6686  
##  Mean   :40.95   admin.     :2580                    unknown  :  887  
##  3rd Qu.:48.00   services   :2128                                     
##  Max.   :95.00   retired    :1107                                     
##                  (Other)    :3419                                     
##  default        balance       housing      loan            contact     
##  no :22210   Min.   : -6847   no :10041   no :19013   cellular :14616  
##  yes:  396   1st Qu.:    76   yes:12565   yes: 3593   telephone: 1466  
##              Median :   453                           unknown  : 6524  
##              Mean   :  1360                                            
##              3rd Qu.:  1430                                            
##              Max.   :102127                                            
##                                                                        
##       day            month         duration         campaign     
##  Min.   : 1.00   may    :6857   Min.   :   0.0   Min.   : 1.000  
##  1st Qu.: 8.00   jul    :3470   1st Qu.: 103.0   1st Qu.: 1.000  
##  Median :16.00   aug    :3098   Median : 180.0   Median : 2.000  
##  Mean   :15.84   jun    :2740   Mean   : 257.2   Mean   : 2.789  
##  3rd Qu.:21.00   nov    :1953   3rd Qu.: 318.0   3rd Qu.: 3.000  
##  Max.   :31.00   apr    :1412   Max.   :3881.0   Max.   :63.000  
##                  (Other):3076                                    
##      pdays           previous          poutcome       y        
##  Min.   : -1.00   Min.   : 0.0000   failure: 2432   no :19961  
##  1st Qu.: -1.00   1st Qu.: 0.0000   other  :  909   yes: 2645  
##  Median : -1.00   Median : 0.0000   success:  784              
##  Mean   : 39.97   Mean   : 0.5834   unknown:18481              
##  3rd Qu.: -1.00   3rd Qu.: 0.0000                              
##  Max.   :871.00   Max.   :58.0000                              
##                                                                
##     age_imp       balance_imp     campaign_imp    duration_imp  
##  Min.   :18.00   Min.   :-1941   Min.   :1.000   Min.   :  0.0  
##  1st Qu.:33.00   1st Qu.:   76   1st Qu.:1.000   1st Qu.:103.0  
##  Median :39.00   Median :  453   Median :2.000   Median :180.0  
##  Mean   :40.75   Mean   : 1172   Mean   :2.543   Mean   :241.9  
##  3rd Qu.:48.00   3rd Qu.: 1430   3rd Qu.:3.000   3rd Qu.:318.0  
##  Max.   :70.00   Max.   : 5768   Max.   :8.000   Max.   :751.0  
##                                                                 
##    pdays_imp      previous_imp            job_level   
##  Min.   : -1.0   Min.   :0.0000   st_re        :1588  
##  1st Qu.: -1.0   1st Qu.:0.0000   une_mgmt     :5385  
##  Median : -1.0   Median :0.0000   bl_en_hm_se  :8333  
##  Mean   : 57.1   Mean   :0.5481   ad_sel_te_unk:7300  
##  3rd Qu.: -1.0   3rd Qu.:0.0000                       
##  Max.   :317.0   Max.   :3.0000                       
##                                                       
##              month_level      edu_level     marital_level  
##  Ja_Ma_Ju_Jul_Au_No:18846   Pri_Sec:15033   married:13568  
##  Mar_Se_Oc_De      : 1029   ter_unk: 7573   Si_Di  : 9038  
##  Fe_Ap             : 2731                                  
##                                                            
##                                                            
##                                                            
##                                                            
##  poutcome_level       ObsY      
##  success:  784   Min.   :0.000  
##  rest   :21822   1st Qu.:0.000  
##                  Median :0.000  
##                  Mean   :0.117  
##                  3rd Qu.:0.000  
##                  Max.   :1.000  
## 
summary(bank.test)
##       age                 job           marital          education    
##  Min.   :18.00   blue-collar:4874   divorced: 2574   primary  : 3443  
##  1st Qu.:33.00   management :4739   married :13646   secondary:11577  
##  Median :39.00   technician :3802   single  : 6385   tertiary : 6615  
##  Mean   :40.93   admin.     :2591                    unknown  :  970  
##  3rd Qu.:48.00   services   :2026                                     
##  Max.   :95.00   retired    :1157                                     
##                  (Other)    :3416                                     
##  default        balance      housing      loan            contact     
##  no :22186   Min.   :-8019   no :10040   no :18954   cellular :14669  
##  yes:  419   1st Qu.:   69   yes:12565   yes: 3651   telephone: 1440  
##              Median :  444                           unknown  : 6496  
##              Mean   : 1365                                            
##              3rd Qu.: 1423                                            
##              Max.   :98417                                            
##                                                                       
##       day            month         duration         campaign     
##  Min.   : 1.00   may    :6909   Min.   :   0.0   Min.   : 1.000  
##  1st Qu.: 8.00   jul    :3425   1st Qu.: 103.0   1st Qu.: 1.000  
##  Median :16.00   aug    :3149   Median : 180.0   Median : 2.000  
##  Mean   :15.77   jun    :2601   Mean   : 259.2   Mean   : 2.739  
##  3rd Qu.:21.00   nov    :2017   3rd Qu.: 319.0   3rd Qu.: 3.000  
##  Max.   :31.00   apr    :1520   Max.   :4918.0   Max.   :58.000  
##                  (Other):2984                                    
##      pdays           previous           poutcome       y        
##  Min.   : -1.00   Min.   :  0.0000   failure: 2469   no :19961  
##  1st Qu.: -1.00   1st Qu.:  0.0000   other  :  931   yes: 2644  
##  Median : -1.00   Median :  0.0000   success:  727              
##  Mean   : 40.42   Mean   :  0.5772   unknown:18478              
##  3rd Qu.: -1.00   3rd Qu.:  0.0000                              
##  Max.   :842.00   Max.   :275.0000                              
##                                                                 
##     age_imp       balance_imp     campaign_imp    duration_imp  
##  Min.   :18.00   Min.   :-1944   Min.   :1.000   Min.   :  0.0  
##  1st Qu.:33.00   1st Qu.:   69   1st Qu.:1.000   1st Qu.:103.0  
##  Median :39.00   Median :  444   Median :2.000   Median :180.0  
##  Mean   :40.74   Mean   : 1178   Mean   :2.512   Mean   :243.4  
##  3rd Qu.:48.00   3rd Qu.: 1423   3rd Qu.:3.000   3rd Qu.:319.0  
##  Max.   :70.00   Max.   : 5768   Max.   :8.000   Max.   :751.0  
##                                                                 
##    pdays_imp       previous_imp            job_level   
##  Min.   : -1.00   Min.   :0.0000   st_re        :1614  
##  1st Qu.: -1.00   1st Qu.:0.0000   une_mgmt     :5376  
##  Median : -1.00   Median :0.0000   bl_en_hm_se  :8280  
##  Mean   : 57.06   Mean   :0.5477   ad_sel_te_unk:7335  
##  3rd Qu.: -1.00   3rd Qu.:0.0000                       
##  Max.   :317.00   Max.   :3.0000                       
##                                                        
##              month_level      edu_level     marital_level  
##  Ja_Ma_Ju_Jul_Au_No:18776   Pri_Sec:15020   married:13646  
##  Mar_Se_Oc_De      :  979   ter_unk: 7585   Si_Di  : 8959  
##  Fe_Ap             : 2850                                  
##                                                            
##                                                            
##                                                            
##                                                            
##  poutcome_level       ObsY      
##  success:  727   Min.   :0.000  
##  rest   :21878   1st Qu.:0.000  
##                  Median :0.000  
##                  Mean   :0.117  
##                  3rd Qu.:0.000  
##                  Max.   :1.000  
## 

Logistic regression with original data (unstandardized) is performed to analyse the significant variable. For every n level of categorical variable, R automatically creates n-1 dummy variable.

#Logistic Regression - Original Data
bank.glm <- glm(y ~ age + job + marital + education + default
                + balance + housing + loan + contact + day
                + month + duration + campaign + pdays + previous
                + poutcome, data = bank.train, family = "binomial")
summary(bank.glm)
## 
## Call:
## glm(formula = y ~ age + job + marital + education + default + 
##     balance + housing + loan + contact + day + month + duration + 
##     campaign + pdays + previous + poutcome, family = "binomial", 
##     data = bank.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.9144  -0.3752  -0.2510  -0.1485   3.5216  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -2.318e+00  2.604e-01  -8.900  < 2e-16 ***
## age                -4.333e-03  3.137e-03  -1.381  0.16720    
## jobblue-collar     -4.108e-01  1.050e-01  -3.911 9.20e-05 ***
## jobentrepreneur    -3.672e-01  1.818e-01  -2.020  0.04339 *  
## jobhousemaid       -5.240e-01  1.894e-01  -2.766  0.00567 ** 
## jobmanagement      -1.471e-01  1.039e-01  -1.416  0.15672    
## jobretired          2.675e-01  1.411e-01   1.897  0.05789 .  
## jobself-employed   -7.208e-02  1.549e-01  -0.465  0.64176    
## jobservices        -6.474e-02  1.147e-01  -0.565  0.57240    
## jobstudent          4.427e-01  1.527e-01   2.899  0.00374 ** 
## jobtechnician      -1.139e-01  9.802e-02  -1.162  0.24541    
## jobunemployed      -1.146e-01  1.527e-01  -0.750  0.45309    
## jobunknown         -6.853e-01  3.733e-01  -1.836  0.06639 .  
## maritalmarried     -1.551e-01  8.293e-02  -1.870  0.06153 .  
## maritalsingle      -3.267e-02  9.548e-02  -0.342  0.73223    
## educationsecondary  8.450e-02  9.091e-02   0.930  0.35260    
## educationtertiary   2.601e-01  1.053e-01   2.470  0.01352 *  
## educationunknown    2.087e-02  1.531e-01   0.136  0.89153    
## defaultyes         -1.249e-01  2.440e-01  -0.512  0.60874    
## balance             1.745e-05  7.044e-06   2.477  0.01326 *  
## housingyes         -6.944e-01  6.203e-02 -11.195  < 2e-16 ***
## loanyes            -4.447e-01  8.537e-02  -5.209 1.90e-07 ***
## contacttelephone   -2.000e-01  1.070e-01  -1.870  0.06145 .  
## contactunknown     -1.668e+00  1.039e-01 -16.050  < 2e-16 ***
## day                 1.519e-02  3.509e-03   4.330 1.49e-05 ***
## monthaug           -8.444e-01  1.118e-01  -7.550 4.36e-14 ***
## monthdec            4.133e-01  2.491e-01   1.659  0.09715 .  
## monthfeb           -7.013e-02  1.251e-01  -0.561  0.57492    
## monthjan           -1.450e+00  1.706e-01  -8.504  < 2e-16 ***
## monthjul           -9.436e-01  1.087e-01  -8.678  < 2e-16 ***
## monthjun            3.270e-01  1.321e-01   2.475  0.01333 *  
## monthmar            1.446e+00  1.686e-01   8.577  < 2e-16 ***
## monthmay           -4.499e-01  1.017e-01  -4.425 9.65e-06 ***
## monthnov           -1.064e+00  1.207e-01  -8.811  < 2e-16 ***
## monthoct            7.579e-01  1.520e-01   4.986 6.15e-07 ***
## monthsep            7.523e-01  1.667e-01   4.512 6.42e-06 ***
## duration            4.155e-03  9.162e-05  45.352  < 2e-16 ***
## campaign           -1.041e-01  1.485e-02  -7.011 2.36e-12 ***
## pdays              -2.230e-04  4.366e-04  -0.511  0.60954    
## previous            3.304e-02  1.306e-02   2.530  0.01139 *  
## poutcomeother       2.277e-01  1.283e-01   1.774  0.07609 .  
## poutcomesuccess     2.370e+00  1.159e-01  20.445  < 2e-16 ***
## poutcomeunknown     6.147e-02  1.363e-01   0.451  0.65189    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 16318  on 22605  degrees of freedom
## Residual deviance: 10794  on 22563  degrees of freedom
## AIC: 10880
## 
## Number of Fisher Scoring iterations: 6
confint(bank.glm)
## Waiting for profiling to be done...
##                            2.5 %        97.5 %
## (Intercept)        -2.829593e+00 -1.808570e+00
## age                -1.049154e-02  1.805619e-03
## jobblue-collar     -6.166407e-01 -2.047687e-01
## jobentrepreneur    -7.314016e-01 -1.803935e-02
## jobhousemaid       -9.033648e-01 -1.600604e-01
## jobmanagement      -3.501340e-01  5.719013e-02
## jobretired         -9.623689e-03  5.434802e-01
## jobself-employed   -3.796936e-01  2.280504e-01
## jobservices        -2.903481e-01  1.593643e-01
## jobstudent          1.418719e-01  7.406701e-01
## jobtechnician      -3.055539e-01  7.877701e-02
## jobunemployed      -4.172646e-01  1.816343e-01
## jobunknown         -1.464055e+00  6.935223e-03
## maritalmarried     -3.162833e-01  8.892838e-03
## maritalsingle      -2.189260e-01  1.554259e-01
## educationsecondary -9.250782e-02  2.639275e-01
## educationtertiary   5.453223e-02  4.673857e-01
## educationunknown   -2.823900e-01  3.179040e-01
## defaultyes         -6.305556e-01  3.295936e-01
## balance             3.502218e-06  3.123946e-05
## housingyes         -8.163315e-01 -5.731356e-01
## loanyes            -6.141616e-01 -2.793998e-01
## contacttelephone   -4.124924e-01  6.925721e-03
## contactunknown     -1.873302e+00 -1.465797e+00
## day                 8.318117e-03  2.207532e-02
## monthaug           -1.063598e+00 -6.250671e-01
## monthdec           -7.841469e-02  8.994774e-01
## monthfeb           -3.156536e-01  1.746693e-01
## monthjan           -1.790998e+00 -1.121816e+00
## monthjul           -1.156907e+00 -7.305166e-01
## monthjun            6.799700e-02  5.860070e-01
## monthmar            1.115709e+00  1.777115e+00
## monthmay           -6.487159e-01 -2.500866e-01
## monthnov           -1.301809e+00 -8.283232e-01
## monthoct            4.591538e-01  1.055172e+00
## monthsep            4.249188e-01  1.078787e+00
## duration            3.976978e-03  4.336163e-03
## campaign           -1.338818e-01 -7.567839e-02
## pdays              -1.083564e-03  6.284931e-04
## previous            6.470268e-03  5.793068e-02
## poutcomeother      -2.573233e-02  4.776252e-01
## poutcomesuccess     2.144159e+00  2.598659e+00
## poutcomeunknown    -2.048900e-01  3.293991e-01

Logistic Regression shows, Job: Blue-collar, entrepreneur, housemaid, student; Education: Tertiary; Balance; housing loan; personal loan; contact: unknown; days; months: all except dec & feb; duration; campaign; previous; poutcome: success along with their reference variables are statistically significant in influencing the outcome of term deposit subscription. Their confidence intervals are not overlapping with 0 as well. The Estimate (Beta) shows the odds.

Logistic Regression with Standardized data (Outlier reduction and level standardization for categorical variables) is performed to analyse the influence of variables on the outcome.

#Logistic Regression - Standarized Data. poutcome removed due to its correlation with previous campaign
bank.glm1 <- glm(y ~ age_imp + job_level + marital_level + edu_level + default
                + balance_imp + housing + loan + contact + day
                + month_level + duration_imp + campaign_imp + previous_imp
                + poutcome_level, data = bank.train, family = "binomial")
summary(bank.glm1)
## 
## Call:
## glm(formula = y ~ age_imp + job_level + marital_level + edu_level + 
##     default + balance_imp + housing + loan + contact + day + 
##     month_level + duration_imp + campaign_imp + previous_imp + 
##     poutcome_level, family = "binomial", data = bank.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2620  -0.3615  -0.2260  -0.1393   3.1544  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -5.320e-01  2.038e-01  -2.610  0.00905 ** 
## age_imp                 -1.142e-02  2.668e-03  -4.280 1.87e-05 ***
## job_levelune_mgmt       -4.804e-01  9.946e-02  -4.830 1.37e-06 ***
## job_levelbl_en_hm_se    -6.643e-01  9.745e-02  -6.816 9.35e-12 ***
## job_levelad_sel_te_unk  -4.284e-01  9.329e-02  -4.592 4.40e-06 ***
## marital_levelSi_Di       1.646e-01  5.540e-02   2.971  0.00297 ** 
## edu_levelter_unk         1.439e-01  6.164e-02   2.335  0.01953 *  
## defaultyes              -2.255e-01  2.504e-01  -0.901  0.36776    
## balance_imp              8.501e-05  1.364e-05   6.230 4.67e-10 ***
## housingyes              -6.982e-01  5.604e-02 -12.458  < 2e-16 ***
## loanyes                 -4.345e-01  8.254e-02  -5.264 1.41e-07 ***
## contacttelephone        -1.988e-01  1.057e-01  -1.880  0.06007 .  
## contactunknown          -9.439e-01  8.037e-02 -11.744  < 2e-16 ***
## day                      1.692e-03  3.091e-03   0.547  0.58412    
## month_levelMar_Se_Oc_De  1.688e+00  8.890e-02  18.987  < 2e-16 ***
## month_levelFe_Ap         6.635e-01  6.969e-02   9.521  < 2e-16 ***
## duration_imp             5.879e-03  1.171e-04  50.198  < 2e-16 ***
## campaign_imp            -9.764e-02  1.631e-02  -5.985 2.17e-09 ***
## previous_imp             4.816e-02  2.348e-02   2.051  0.04028 *  
## poutcome_levelrest      -2.386e+00  1.066e-01 -22.379  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 16318  on 22605  degrees of freedom
## Residual deviance: 10646  on 22586  degrees of freedom
## AIC: 10686
## 
## Number of Fisher Scoring iterations: 6
confint(bank.glm1)
## Waiting for profiling to be done...
##                                 2.5 %        97.5 %
## (Intercept)             -0.9315878401 -0.1324864051
## age_imp                 -0.0166546744 -0.0061967706
## job_levelune_mgmt       -0.6748022469 -0.2848617685
## job_levelbl_en_hm_se    -0.8547156219 -0.4726372072
## job_levelad_sel_te_unk  -0.6105211379 -0.2447579610
## marital_levelSi_Di       0.0559515719  0.2731330278
## edu_levelter_unk         0.0229260732  0.2645773111
## defaultyes              -0.7420922272  0.2429016000
## balance_imp              0.0000581443  0.0001116388
## housingyes              -0.8083197482 -0.5886022343
## loanyes                 -0.5981490412 -0.2744990594
## contacttelephone        -0.4085118204  0.0061072290
## contactunknown          -1.1030303157 -0.7878752115
## day                     -0.0043688521  0.0077492129
## month_levelMar_Se_Oc_De  1.5134700862  1.8619940750
## month_levelFe_Ap         0.5265487244  0.7997844527
## duration_imp             0.0056508636  0.0061100178
## campaign_imp            -0.1299331340 -0.0659662839
## previous_imp             0.0019044272  0.0939686072
## poutcome_levelrest      -2.5961609746 -2.1781275004

The analysis shows that age, job, education, balance, housing loan, personal loan, contact: unknown & cellular, months, duration, campaign, previous, poutcome became statistically significant variables in influencing the term deposit subscription.

Data Prediction has been carried out with respect to the original data and standardized dataset.

#Prediction - Original data
bank.test$predSub <- predict.glm(bank.glm, newdata = bank.test, type = "response")
headTail(bank.test)
#Prediction - Standardized data
bank.test$predSub1 <- predict.glm(bank.glm1, newdata = bank.test, type = "response")
headTail(bank.test)

INFERENCES FROM LOGISTIC REGRESSION OF IMBALANCED DATA

#Confusion Matrix - Original data
library(SDMTools)
## 
## Attaching package: 'SDMTools'
## The following objects are masked from 'package:caret':
## 
##     sensitivity, specificity
confusion.matrix(bank.test$ObsY, bank.test$predSub, threshold = 0.5)
##     obs
## pred     0    1
##    0 19491 1743
##    1   470  901
## attr(,"class")
## [1] "confusion.matrix"
accuracy(bank.test$ObsY, bank.test$predSub, threshold = 0.5)
#Confusion Matrix - Standarized data
library(SDMTools)
confusion.matrix(bank.test$ObsY, bank.test$predSub1, threshold = 0.5)
##     obs
## pred     0    1
##    0 19396 1694
##    1   565  950
## attr(,"class")
## [1] "confusion.matrix"
accuracy(bank.test$ObsY, bank.test$predSub1, threshold = 0.5)

Confusion Matrix of Standardized data has better prediction of True Positive (including Sensitivity) while the Original data has better prediction of True Negative (Specificity). The accuracy of the model is better for Standardized dataset (66.55%) when compared to the original dataset (65.86%). Cohen’s Kappa which is the classification accuracy normalized by the imbalance of the classes in the data is little better for the Standardized data (0.406) when compared to the original data (0.401). Landis and Koch (1977) provide a way to characterize values. According to their scheme a value < 0 is indicating no agreement, 0-0.20 as slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1 as almost perfect agreement

#ROC Curve - Original data
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following object is masked from 'package:SDMTools':
## 
##     auc
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
myROC <- roc(bank.test$y, bank.test$predSub)
myROC
## 
## Call:
## roc.default(response = bank.test$y, predictor = bank.test$predSub)
## 
## Data: bank.test$predSub in 19961 controls (bank.test$y no) < 2644 cases (bank.test$y yes).
## Area under the curve: 0.9059
plot(myROC)

#ROC Curve - Standarized data
library(pROC)
myROC1 <- roc(bank.test$y, bank.test$predSub1)
myROC1
## 
## Call:
## roc.default(response = bank.test$y, predictor = bank.test$predSub1)
## 
## Data: bank.test$predSub1 in 19961 controls (bank.test$y no) < 2644 cases (bank.test$y yes).
## Area under the curve: 0.9086
plot(myROC1)

AUC of Standardized data is better at 90.86% when compared to the original data 90.59%

Optimum Threshold: To find the optimal threshold, I created a plot from the ROC attributes - Sensitivity, Specificity and Threshold. The point where both Sensitivity and Specificity are intersecting, the corresponding threshold is the optimum.

# look at TPR (Sensitivity) and TNR (Specificity) distribution over threshold for Original data
matplot(data.frame(myROC$sensitivities, myROC$specificities), x = myROC$thresholds, type='l', xlab = 'threshold', ylab='TPR, TNR')
legend('bottomright', legend=c('TPR', 'TNR'), lty=1:2, col=1:2)

# Confusion matrix for Original data with Optimal threshold of 0.1
confusion.matrix(bank.test$ObsY, bank.test$predSub, threshold = 0.1)
##     obs
## pred     0    1
##    0 16184  372
##    1  3777 2272
## attr(,"class")
## [1] "confusion.matrix"
accuracy(bank.test$ObsY, bank.test$predSub,0.10)
# look at TPR (Sensitivity) and TNR (Specificity) distribution over threshold for Standardized data
matplot(data.frame(myROC1$sensitivities, myROC1$specificities), x = myROC1$thresholds, type='l', xlab = 'threshold', ylab='TPR, TNR')
legend('bottomright', legend=c('TPR', 'TNR'), lty=1:2, col=1:2)

# Confusion matrix for Standardized data with Optimal threshold of 0.1
confusion.matrix(bank.test$ObsY, bank.test$predSub1, threshold = 0.1)
##     obs
## pred     0    1
##    0 16127  374
##    1  3834 2270
## attr(,"class")
## [1] "confusion.matrix"
accuracy(bank.test$ObsY, bank.test$predSub1,0.1)

With both Standardized and Original datasets, the optimum threshold is 0.1 which means that if the probability is >0.1, we should conclude that the subscriber will subscribe to the term deposit. Sensitivity, Specificity and Accuracy are greater than 80%.

DATA BALANCING: As the dataset is imbalanced, we need to perform data balancing to improve the prediction. There are many methods available like Under Sampling, Over Sampling, SMOTE etc. For this model, I used Random Over Sampling method (ROSE) with Standardized data.

Iteration with Balancing through ROSE method for Standardized data

#Random Over Sampling (ROSE) for Original data and perform Logistic Regression

library(ROSE)
## Loaded ROSE 0.0-3
#ROSE
data_rose <- ROSE(y ~ age_imp + job_level + marital_level + edu_level + default
                + balance_imp + housing + loan + contact + day
                + month_level + duration_imp + campaign_imp + pdays_imp + previous_imp
                + poutcome_level, data = bank.train, seed = 1)$data
table(data_rose$y)
## 
##    no   yes 
## 11372 11234

As we can see above, ROSE made the minority positive outcome to be equal to majority negative outcome in the train data. This can be used to perform logistic regression and predict the test data.

LOGISTIC REGRESSION OF BALANCED STANDARDIZED DATA: Logistic regression with Standardized and balanced data is performed as given below

#build logistic model Balanced data
glm.rose <- glm(y ~ age_imp + job_level + marital_level + edu_level + default
                + balance_imp + housing + loan + contact + day
                + month_level + duration_imp + campaign_imp + pdays_imp + previous_imp
                + poutcome_level,data = data_rose, family = "binomial")
summary(glm.rose)
## 
## Call:
## glm(formula = y ~ age_imp + job_level + marital_level + edu_level + 
##     default + balance_imp + housing + loan + contact + day + 
##     month_level + duration_imp + campaign_imp + pdays_imp + previous_imp + 
##     poutcome_level, family = "binomial", data = data_rose)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.8615  -0.6247  -0.1497   0.6465   3.0901  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              1.675e+00  1.528e-01  10.964  < 2e-16 ***
## age_imp                 -1.050e-02  1.709e-03  -6.146 7.94e-10 ***
## job_levelune_mgmt       -5.313e-01  7.480e-02  -7.103 1.22e-12 ***
## job_levelbl_en_hm_se    -7.050e-01  7.180e-02  -9.819  < 2e-16 ***
## job_levelad_sel_te_unk  -4.409e-01  7.003e-02  -6.296 3.05e-10 ***
## marital_levelSi_Di       1.792e-01  3.893e-02   4.602 4.18e-06 ***
## edu_levelter_unk         2.083e-01  4.452e-02   4.678 2.90e-06 ***
## defaultyes              -2.771e-01  1.619e-01  -1.712  0.08695 .  
## balance_imp              9.738e-05  9.221e-06  10.561  < 2e-16 ***
## housingyes              -7.666e-01  3.945e-02 -19.435  < 2e-16 ***
## loanyes                 -4.643e-01  5.540e-02  -8.380  < 2e-16 ***
## contacttelephone        -1.124e-01  7.425e-02  -1.514  0.13015    
## contactunknown          -1.049e+00  5.246e-02 -20.000  < 2e-16 ***
## day                     -5.683e-03  2.008e-03  -2.830  0.00465 ** 
## month_levelMar_Se_Oc_De  2.007e+00  7.397e-02  27.131  < 2e-16 ***
## month_levelFe_Ap         8.579e-01  5.027e-02  17.066  < 2e-16 ***
## duration_imp             5.901e-03  8.830e-05  66.825  < 2e-16 ***
## campaign_imp            -1.122e-01  9.931e-03 -11.295  < 2e-16 ***
## pdays_imp                4.114e-04  2.312e-04   1.779  0.07516 .  
## previous_imp             2.451e-02  2.463e-02   0.995  0.31961    
## poutcome_levelrest      -2.545e+00  9.805e-02 -25.958  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 31338  on 22605  degrees of freedom
## Residual deviance: 19088  on 22585  degrees of freedom
## AIC: 19130
## 
## Number of Fisher Scoring iterations: 5

This shows that loan default, contact: telephone, pdays and previous are statistically not significant in predicting the outcome.

This logistic regression performed on the train data is used to predict the test data as given below

#make predictions with Balanced data
pred.glm.rose <- predict.glm(glm.rose, newdata = bank.test, type = "response")

INFERENCES FROM LOGISTIC REGRESSION OF BALANCED DATA

# Confusion Matrix
confusion.matrix(bank.test$ObsY, pred.glm.rose)
##     obs
## pred     0    1
##    0 16642  455
##    1  3319 2189
## attr(,"class")
## [1] "confusion.matrix"
accuracy(bank.test$ObsY, pred.glm.rose)

Confusion Matrix shows that sensitivity, specificity and accuracy are greater than 82% and balanced. Moreover this prediction is for threshold 0.5 while the earlier model has the optimum threshold of 0.1

#AUC ROSE
myROCS <-roc(bank.test$ObsY, pred.glm.rose)
myROCS
## 
## Call:
## roc.default(response = bank.test$ObsY, predictor = pred.glm.rose)
## 
## Data: pred.glm.rose in 19961 controls (bank.test$ObsY 0) < 2644 cases (bank.test$ObsY 1).
## Area under the curve: 0.908
plot(myROCS)

AUC remains same of 90.8% when compared to the earlier model

# look at TPR (Sensitivity) and TNR (Specificity) distribution over threshold for Original data
matplot(data.frame(myROCS$sensitivities, myROCS$specificities), x = myROCS$thresholds, type='l', xlab = 'threshold', ylab='TPR, TNR')
legend('bottomright', legend=c('TPR', 'TNR'), lty=1:2, col=1:2)

The optimum threshold is 0.5 since that data is perfectly balanced. Customers with predicted probabilities greater than 0.5 will subscribe to the term deposit.

DECISION TREE FOR STANDARDIZED BALANCED DATA Since this is a classification problem, this can also be addressed through decision tree. Here I took the standardized and balanced dataset (ROSE) to create decision tree. Decision tree is created based on Recursive Partitioning algorithm (rpart).

# Decision Tree for standardised data with ROSE Balancing
library(rpart)
tree.rose <- rpart(y ~., data = data_rose)
tree.rose
## n= 22606 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 22606 11234 no (0.5030523 0.4969477)  
##    2) duration_imp< 299.0133 12482  3815 no (0.6943599 0.3056401)  
##      4) poutcome_level=rest 11273  2729 no (0.7579171 0.2420829)  
##        8) month_level=Ja_Ma_Ju_Jul_Au_No 8705  1251 no (0.8562895 0.1437105) *
##        9) month_level=Mar_Se_Oc_De,Fe_Ap 2568  1090 yes (0.4244548 0.5755452)  
##         18) housing=yes 972   353 no (0.6368313 0.3631687) *
##         19) housing=no 1596   471 yes (0.2951128 0.7048872) *
##      5) poutcome_level=success 1209   123 yes (0.1017370 0.8982630) *
##    3) duration_imp>=299.0133 10124  2705 yes (0.2671869 0.7328131)  
##      6) duration_imp< 511.7324 4562  1812 yes (0.3971942 0.6028058)  
##       12) contact=unknown 709   147 no (0.7926657 0.2073343) *
##       13) contact=cellular,telephone 3853  1250 yes (0.3244225 0.6755775) *
##      7) duration_imp>=511.7324 5562   893 yes (0.1605538 0.8394462) *
library(rpart.plot)
rpart.plot(tree.rose, nn = TRUE)

As we can infer from the above, the duration of the call is the root node with probability of not subscribing to the term deposit is 50%. If the duration of the call is greater than 299 seconds, the probability of positive outcome is 73% (45% of the total population). Within this, if the duration if greater than 512 seconds, the probability of positive outcome is 84% (25% of the overall population). If the duration of the call is less than 512 seconds, the probability of positive outcome is 60% (20% of the total population). Within this, if the contact mode is cellular or telephone (not unknown), the probability of positive outcome is 68% (17% of the population). If the duration of the call is less than 299 seconds and if the probability of previous outcome is success, the probability of positive outcome is 90% (5% of the population). If the duration if less than 299 seconds and if the previous outcome is failure or unknown and if the month of contact is Feb, Apr, May, Sep, Oct, Dec, the probability of positive outcome is 58% (11% of the population). Within this, if the person is not having housing loan, then the probability of positive outcome is 70% (7% of the population). This decision tree probabilities from train data is used to predict the test data.

#Decision Tree Prediction
pred.tree.rose <- predict(tree.rose, newdata = bank.test)
pred.tree.rose
#Confusion Matrix
library(SDMTools)
confusion.matrix(bank.test$ObsY, pred.tree.rose[,2])
##     obs
## pred     0    1
##    0 15393  396
##    1  4568 2248
## attr(,"class")
## [1] "confusion.matrix"
accuracy(bank.test$ObsY, pred.tree.rose[,2])

The confusion matrix for decision tree shows that True positive (and Sensitivity) is increased while the True negative (and Specificity) is decreased. Accuracy is also reduced to 81% when compared to Logistic Regression. Cohen’s Kappa is also lesser when compared to Logistic Regression.

#ROC Curve
library(pROC)
myROCD <- roc(bank.test$y, pred.tree.rose[,2])
myROCD
## 
## Call:
## roc.default(response = bank.test$y, predictor = pred.tree.rose[,     2])
## 
## Data: pred.tree.rose[, 2] in 19961 controls (bank.test$y no) < 2644 cases (bank.test$y yes).
## Area under the curve: 0.8448
plot(myROCD)

AUC has been decreased a lot to 84.48% when compared to the Logistic Regression.

# look at TPR (Sensitivity) and TNR (Specificity) distribution over threshold
matplot(data.frame(myROCD$sensitivities, myROCD$specificities), x = myROCD$thresholds, type='l', xlab = 'threshold', ylab='TPR, TNR')
legend('bottomright', legend=c('TPR', 'TNR'), lty=1:2, col=1:2)

confusion.matrix(bank.test$ObsY, pred.tree.rose[,2], threshold = 0.55)
##     obs
## pred     0    1
##    0 15393  396
##    1  4568 2248
## attr(,"class")
## [1] "confusion.matrix"
accuracy(bank.test$ObsY, pred.tree.rose[,2],0.55)

The optimum threshold is found to be 0.55. The specificity still remains less than 80%.

INFERENCES ON THE MODEL: Per the extensive data analysis done using logistic regression models (with structured data, with original data, with structured and balanced data) and decision tree model (with structured and balanced data), it is inferred that the best model among these in doing the prediction is Logistic Regression with Structured and Balanced Data. The Accuracy of the model is 83% with Sensitivity of 82.79%, Specificity of 83.37% and with threshold of 0.5. Cohen’s Kappa of 0.45 shows that the model has moderate fit. AUC is good at 90.8%