The primary function of bank is to accept deposits for the purpose of lending. They mobilize capital by offering interests at specific rates and lend the money to individuals, institutions based on their profile and requirements at a higher interest rate thereby making profit. Banks normally promote their products (Services) through direct marketing campaign. The data I analysed is from a Portugal based bank which did direct marketing campaign to sell a term deposit product by making phone calls. The data has been generated based on the campaign targeting the potential individuals in year 2008 to 2010.
The business objective for this project is to identify the target customers who will subscribe to the newly launched term deposit plan in the Portuguese Bank by analysing few indicators which includes the client’s data available in the bank along with few more constructs on the marketing campaign. The Objectives are broadly classified into two: 1) Determine the variables that are statistically significant and thereby influencing the purchase of term deposit 2) Determine the probability of each customer who will subscribe to the term deposit. The success criteria would be the “Accuracy with which the potential customers are identified for the term deposit subscription at an early stage without the corresponding risk of incorrectly tagging the non-subscribers as potential subscribers and spending the time and money on them”.
rm(list=ls())
getwd()
## [1] "C:/PGPMFX/Term5/BA/Project"
setwd("C:/PGPMFX/Term5/BA/Project")
bank <-read.csv("bank-full.csv", sep = ";")
library(psych)
headTail(bank)
str(bank)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
The data contains 45211 records with 16 independent variables and 1 dependent variable. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (or not) subscribed. Data is mix of Continuous and Categorical variables including demographic details. The predictor variable is probability (0 to 1) value. This dataset has no missed out values and hence no imputation is necessary. Moreover, there is no need to code as the variable types are already correct. The dependent variable - term deposit subscription alone was recoded to 0 and 1 binary instead of two factors - yes and no. The Null Hypothesis H0 is that there is no significant relationship between the term deposit subscription and the remaining 16 variables. Alternate Hypothesis, H1 is that there is significant relationship between the dependent and independent variables.
Bar Charts:
# Grouped Bar Plot
barplot(table(bank$y, bank$education), main="Distribution of Education vs Deposit",
xlab="Education", col=c("darkblue","red"),
legend = rownames(table(bank$y, bank$education)), beside=TRUE)
The above chart of Education Vs Deposit shows that clients with secondary education is high but their subscription to term deposit is not so high.
barplot(table(bank$y, bank$marital), main="Distribution of Marital Status vs Deposit",
xlab="Marital Status", col=c("darkblue","red"),
legend = rownames(table(bank$y, bank$marital)), beside=TRUE)
The above chart of marital status vs Deposit shows that the large number of clients are married. In absolute terms, their subscription is high. But in relative terms, the clients who are single are subscribed higher.
Pie Chart
# Pie Chart of Term Deposit Subscription
library(plotrix)
##
## Attaching package: 'plotrix'
## The following object is masked from 'package:psych':
##
## rescale
tab_sub <- table(bank$y)
pct <- round(tab_sub/sum(tab_sub)*100)
#lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(pct,"%",sep="") # ad % to labels
pie3D(tab_sub,labels=lbls,explode=0.5,
main="Pie Chart of Term Deposit Subscription ")
The above Pie chart shows that 12% of the clients are subscribed and 88% of the clients are not subscribed to term deposit. The data is imbalanced. The analysis were done with and without imbalanced data.
Histograms Histograms are created for continuous data. The outliers are truncated to understand the data better. Age seems to be normally distributed while the other continuous variables like Bank Balance, no. of contacts made for a customer in present campaign, duration of the call etc. are not normally distributed. There are outliers in these variables.
hist(bank$age, col = "green")
hist(bank$balance, col = "blue", xlim = c(0,20000))
hist(bank$campaign, col = "cyan", xlim = c(0,30))
hist(bank$day, col = "magenta")
hist(bank$duration, col = "violet", xlim = c(0,2000))
hist(bank$pdays, col = "dark green", xlim = c(0,400))
hist(bank$previous, col = "dark blue", xlim = c(0,50))
Box Plots Box plots of continuous variables are created with respect to the subscription. These give an understanding of the outliers present in the data.
boxplot(age~y, data=bank, notch=TRUE,
col=(c("gold","darkgreen")),
main="Age and Subscription", xlab="Subscription")
boxplot(balance~y, data=bank, notch=TRUE,
col=(c("gold","blue")),
main="Balance and Subscription", xlab="Subscription")
boxplot(campaign~y, data=bank, notch=TRUE,
col=(c("gold","cyan")),
main="Campaign and Subscription", xlab="Subscription")
boxplot(day~y, data=bank, notch=FALSE,
col=(c("gold","magenta")),
main="Day and Subscription", xlab="Subscription")
boxplot(duration~y, data=bank, notch=TRUE,
col=(c("gold","violet")),
main="Duration and Subscription", xlab="Subscription")
boxplot(pdays~y, data=bank, notch=FALSE,
col=(c("gold","dark green")),
main="Days Passed and Subscription", xlab="Subscription")
boxplot(previous~y, data=bank, notch=FALSE,
col=(c("gold","dark blue")),
main="Previous Campaign and Subscription", xlab="Subscription")
Descriptive Statistics
#Descriptive Stats Summary
library(dlookr)
##
## Attaching package: 'dlookr'
## The following object is masked from 'package:psych':
##
## describe
## The following object is masked from 'package:base':
##
## transform
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
describe(bank)
Descriptive Statistics shows the Skewness present in the dataset. To remove the outliers and to make the data to be normally distributed, further analysis carried out. Below graphs shows how the data can be transformed and which transformation will work.
plot_normality(bank, balance)
## Warning in log(x): NaNs produced
## Warning in sqrt(x): NaNs produced
plot_normality(bank, day)
plot_normality(bank, duration)
plot_normality(bank,campaign)
plot_normality(bank, pdays)
## Warning in log(x): NaNs produced
## Warning in log(x): NaNs produced
plot_normality(bank, previous)
plot_correlate(bank)
Correlation of the continuous variables shows that none of the variables are strongly correlated with respect to the others. pdays and previous have weak correlation.
Plot Outliers To further analyse the outliers, outliers plot were created. The objective of these outliers plot is to analyse whether by “capping” the bottom 5 percentile and top 95 percentile will make the distribution to be smoother since I don’t want to make all the data to be normally distributed because of the data nature.
bank %>%
plot_outlier(balance)
bank %>%
plot_outlier(day)
bank %>%
plot_outlier(duration)
bank %>%
plot_outlier(campaign)
bank %>%
plot_outlier(pdays)
bank %>%
plot_outlier(previous)
Data Standardization: As mentioned above, the outliers the bottom 5 percentile and top 95 percentile data were capped for some of the continuous variables as given below. The capping resulted in reduction in outliers. The capped data are binded to the master data as separate columns.
# Imputation of Outlier through 95% upper and 5% lower Capping
age_imp <- imputate_outlier(bank, age, method = "capping")
summary(age_imp)
## Impute outliers with capping
##
## * Information of Imputation (before vs after)
## Original Imputation
## n 4.521100e+04 4.521100e+04
## na 0.000000e+00 0.000000e+00
## mean 4.093621e+01 4.074431e+01
## sd 1.061876e+01 1.010553e+01
## se_mean 4.994038e-02 4.752661e-02
## IQR 1.500000e+01 1.500000e+01
## skewness 6.848179e-01 4.039484e-01
## kurtosis 3.195704e-01 -7.281184e-01
## p00 1.800000e+01 1.800000e+01
## p01 2.300000e+01 2.300000e+01
## p05 2.700000e+01 2.700000e+01
## p10 2.900000e+01 2.900000e+01
## p20 3.200000e+01 3.200000e+01
## p25 3.300000e+01 3.300000e+01
## p30 3.400000e+01 3.400000e+01
## p40 3.600000e+01 3.600000e+01
## p50 3.900000e+01 3.900000e+01
## p60 4.200000e+01 4.200000e+01
## p70 4.600000e+01 4.600000e+01
## p75 4.800000e+01 4.800000e+01
## p80 5.100000e+01 5.100000e+01
## p90 5.600000e+01 5.600000e+01
## p95 5.900000e+01 5.900000e+01
## p99 7.100000e+01 6.300000e+01
## p100 9.500000e+01 7.000000e+01
plot(age_imp)
bank <-cbind(bank,age_imp)
balance_imp <- imputate_outlier(bank, balance, method = "capping")
summary(balance_imp)
## Impute outliers with capping
##
## * Information of Imputation (before vs after)
## Original Imputation
## n 45211.000000 45211.000000
## na 0.000000 0.000000
## mean 1362.272058 1174.716949
## sd 3044.765829 1758.845268
## se_mean 14.319631 8.271906
## IQR 1356.000000 1356.000000
## skewness 8.360308 1.771347
## kurtosis 140.751547 2.019813
## p00 -8019.000000 -1944.000000
## p01 -627.000000 -616.900000
## p05 -172.000000 -172.000000
## p10 0.000000 0.000000
## p20 22.000000 22.000000
## p25 72.000000 72.000000
## p30 131.000000 131.000000
## p40 272.000000 272.000000
## p50 448.000000 448.000000
## p60 701.000000 701.000000
## p70 1126.000000 1126.000000
## p75 1428.000000 1428.000000
## p80 1859.000000 1859.000000
## p90 3574.000000 5768.000000
## p95 5768.000000 5768.000000
## p99 13164.900000 5768.000000
## p100 102127.000000 5768.000000
plot(balance_imp)
bank <-cbind(bank,balance_imp)
campaign_imp <- imputate_outlier(bank, campaign, method = "capping")
summary(campaign_imp)
## Impute outliers with capping
##
## * Information of Imputation (before vs after)
## Original Imputation
## n 4.521100e+04 4.521100e+04
## na 0.000000e+00 0.000000e+00
## mean 2.763841e+00 2.527770e+00
## sd 3.098021e+00 1.947076e+00
## se_mean 1.457009e-02 9.157162e-03
## IQR 2.000000e+00 2.000000e+00
## skewness 4.898650e+00 1.588853e+00
## kurtosis 3.924965e+01 1.793558e+00
## p00 1.000000e+00 1.000000e+00
## p01 1.000000e+00 1.000000e+00
## p05 1.000000e+00 1.000000e+00
## p10 1.000000e+00 1.000000e+00
## p20 1.000000e+00 1.000000e+00
## p25 1.000000e+00 1.000000e+00
## p30 1.000000e+00 1.000000e+00
## p40 2.000000e+00 2.000000e+00
## p50 2.000000e+00 2.000000e+00
## p60 2.000000e+00 2.000000e+00
## p70 3.000000e+00 3.000000e+00
## p75 3.000000e+00 3.000000e+00
## p80 4.000000e+00 4.000000e+00
## p90 5.000000e+00 5.000000e+00
## p95 8.000000e+00 8.000000e+00
## p99 1.600000e+01 8.000000e+00
## p100 6.300000e+01 8.000000e+00
plot(campaign_imp)
bank <-cbind(bank,campaign_imp)
duration_imp <- imputate_outlier(bank, duration, method = "capping")
summary(duration_imp)
## Impute outliers with capping
##
## * Information of Imputation (before vs after)
## Original Imputation
## n 45211.000000 4.521100e+04
## na 0.000000 0.000000e+00
## mean 258.163080 2.426671e+02
## sd 257.527812 1.957610e+02
## se_mean 1.211162 9.206703e-01
## IQR 216.000000 2.160000e+02
## skewness 3.144318 1.308020e+00
## kurtosis 18.153915 9.542390e-01
## p00 0.000000 0.000000e+00
## p01 11.000000 1.100000e+01
## p05 35.000000 3.500000e+01
## p10 58.000000 5.800000e+01
## p20 89.000000 8.900000e+01
## p25 103.000000 1.030000e+02
## p30 117.000000 1.170000e+02
## p40 147.000000 1.470000e+02
## p50 180.000000 1.800000e+02
## p60 223.000000 2.230000e+02
## p70 280.000000 2.800000e+02
## p75 319.000000 3.190000e+02
## p80 368.000000 3.680000e+02
## p90 548.000000 5.480000e+02
## p95 751.000000 7.510000e+02
## p99 1269.000000 7.510000e+02
## p100 4918.000000 7.510000e+02
plot(duration_imp)
bank <-cbind(bank,duration_imp)
pdays_imp <- imputate_outlier(bank, pdays, method = "capping")
summary(pdays_imp)
## Impute outliers with capping
##
## * Information of Imputation (before vs after)
## Original Imputation
## n 45211.0000000 45211.0000000
## na 0.0000000 0.0000000
## mean 40.1978280 57.0771494
## sd 100.1287460 122.8654228
## se_mean 0.4709087 0.5778400
## IQR 0.0000000 0.0000000
## skewness 2.6157155 1.6428921
## kurtosis 6.9351952 0.6991253
## p00 -1.0000000 -1.0000000
## p01 -1.0000000 -1.0000000
## p05 -1.0000000 -1.0000000
## p10 -1.0000000 -1.0000000
## p20 -1.0000000 -1.0000000
## p25 -1.0000000 -1.0000000
## p30 -1.0000000 -1.0000000
## p40 -1.0000000 -1.0000000
## p50 -1.0000000 -1.0000000
## p60 -1.0000000 -1.0000000
## p70 -1.0000000 -1.0000000
## p75 -1.0000000 -1.0000000
## p80 -1.0000000 -1.0000000
## p90 185.0000000 317.0000000
## p95 317.0000000 317.0000000
## p99 370.0000000 317.0000000
## p100 871.0000000 317.0000000
plot(pdays_imp)
bank <-cbind(bank,pdays_imp)
previous_imp <- imputate_outlier(bank, previous, method = "capping")
summary(previous_imp)
## Impute outliers with capping
##
## * Information of Imputation (before vs after)
## Original Imputation
## n 4.521100e+04 4.521100e+04
## na 0.000000e+00 0.000000e+00
## mean 5.803234e-01 5.478976e-01
## sd 2.303441e+00 1.159108e+00
## se_mean 1.083316e-02 5.451321e-03
## IQR 0.000000e+00 0.000000e+00
## skewness 4.184645e+01 1.642892e+00
## kurtosis 4.506861e+03 6.991253e-01
## p00 0.000000e+00 0.000000e+00
## p01 0.000000e+00 0.000000e+00
## p05 0.000000e+00 0.000000e+00
## p10 0.000000e+00 0.000000e+00
## p20 0.000000e+00 0.000000e+00
## p25 0.000000e+00 0.000000e+00
## p30 0.000000e+00 0.000000e+00
## p40 0.000000e+00 0.000000e+00
## p50 0.000000e+00 0.000000e+00
## p60 0.000000e+00 0.000000e+00
## p70 0.000000e+00 0.000000e+00
## p75 0.000000e+00 0.000000e+00
## p80 0.000000e+00 0.000000e+00
## p90 2.000000e+00 3.000000e+00
## p95 3.000000e+00 3.000000e+00
## p99 8.900000e+00 3.000000e+00
## p100 2.750000e+02 3.000000e+00
plot(previous_imp)
bank <-cbind(bank,previous_imp)
Box plots for Outliers minimized data: Box plots are again created with the capped data to see the distribution. Compared to the original data, these capped data distribution is good without affecting its distribution much.
#Box Plots - Imputed & Outliers minimized variables
boxplot(age_imp~y, data=bank, notch=TRUE,
col=(c("gold","darkgreen")),
main="Age and Subscription", xlab="Subscription")
boxplot(balance_imp~y, data=bank, notch=TRUE,
col=(c("gold","blue")),
main="Balance and Subscription", xlab="Subscription")
boxplot(campaign_imp~y, data=bank, notch=TRUE,
col=(c("gold","cyan")),
main="Campaign and Subscription", xlab="Subscription")
boxplot(duration_imp~y, data=bank, notch=TRUE,
col=(c("gold","violet")),
main="Duration and Subscription", xlab="Subscription")
boxplot(pdays_imp~y, data=bank, notch=FALSE,
col=(c("gold","dark green")),
main="Days Passed and Subscription", xlab="Subscription")
boxplot(previous_imp~y, data=bank, notch=FALSE,
col=(c("gold","dark blue")),
main="Previous Campaign and Subscription", xlab="Subscription")
Category reduction in variables - Analysis by percentage and absolute numbers There are categorical variables which has more levels like Job, marital status etc. These needs to be analysed and grouped together based on the similarity of the percentage of events occurring when compared to the total of events and non-events.
table(bank$job, bank$y)
##
## no yes
## admin. 4540 631
## blue-collar 9024 708
## entrepreneur 1364 123
## housemaid 1131 109
## management 8157 1301
## retired 1748 516
## self-employed 1392 187
## services 3785 369
## student 669 269
## technician 6757 840
## unemployed 1101 202
## unknown 254 34
round(prop.table(table(bank$job, bank$y),1),2)
##
## no yes
## admin. 0.88 0.12
## blue-collar 0.93 0.07
## entrepreneur 0.92 0.08
## housemaid 0.91 0.09
## management 0.86 0.14
## retired 0.77 0.23
## self-employed 0.88 0.12
## services 0.91 0.09
## student 0.71 0.29
## technician 0.89 0.11
## unemployed 0.84 0.16
## unknown 0.88 0.12
table(bank$education, bank$y)
##
## no yes
## primary 6260 591
## secondary 20752 2450
## tertiary 11305 1996
## unknown 1605 252
round(prop.table(table(bank$education, bank$y),1),2)
##
## no yes
## primary 0.91 0.09
## secondary 0.89 0.11
## tertiary 0.85 0.15
## unknown 0.86 0.14
table(bank$month, bank$y)
##
## no yes
## apr 2355 577
## aug 5559 688
## dec 114 100
## feb 2208 441
## jan 1261 142
## jul 6268 627
## jun 4795 546
## mar 229 248
## may 12841 925
## nov 3567 403
## oct 415 323
## sep 310 269
round(prop.table(table(bank$month, bank$y),1),2)
##
## no yes
## apr 0.80 0.20
## aug 0.89 0.11
## dec 0.53 0.47
## feb 0.83 0.17
## jan 0.90 0.10
## jul 0.91 0.09
## jun 0.90 0.10
## mar 0.48 0.52
## may 0.93 0.07
## nov 0.90 0.10
## oct 0.56 0.44
## sep 0.54 0.46
table(bank$marital, bank$y)
##
## no yes
## divorced 4585 622
## married 24459 2755
## single 10878 1912
round(prop.table(table(bank$marital, bank$y),1),2)
##
## no yes
## divorced 0.88 0.12
## married 0.90 0.10
## single 0.85 0.15
table(bank$poutcome, bank$y)
##
## no yes
## failure 4283 618
## other 1533 307
## success 533 978
## unknown 33573 3386
round(prop.table(table(bank$poutcome, bank$y),1),2)
##
## no yes
## failure 0.87 0.13
## other 0.83 0.17
## success 0.35 0.65
## unknown 0.91 0.09
table(bank$contact, bank$y)
##
## no yes
## cellular 24916 4369
## telephone 2516 390
## unknown 12490 530
round(prop.table(table(bank$contact, bank$y),1),2)
##
## no yes
## cellular 0.85 0.15
## telephone 0.87 0.13
## unknown 0.96 0.04
#Category reduction in variables by similar percentage of events using Rockchalk library
library(rockchalk)
##
## Attaching package: 'rockchalk'
## The following object is masked from 'package:dplyr':
##
## summarize
#Combine Job levels to 4
job_level1 <-combineLevels(bank$job,c("student","retired"),newLabel=c("st_re"))
job_level2 <-combineLevels(job_level1,c("unemployed","management"),newLabel=c("une_mgmt"))
job_level3 <-combineLevels(job_level2,c("blue-collar","entrepreneur","housemaid",
"services"),newLabel=c("bl_en_hm_se"))
job_level <-combineLevels(job_level3,c("admin.","self-employed","technician",
"unknown"),newLabel=c("ad_sel_te_unk"))
table(job_level)
bank <-cbind(bank,job_level)
#Combine month levels to 3
month_level1 <-combineLevels(bank$month,c("aug","jan","jun","jul","may","nov"),newLabel=c("Ja_Ma_Ju_Jul_Au_No"))
month_level2 <-combineLevels(month_level1,c("mar","sep","oct","dec"),newLabel=c("Mar_Se_Oc_De"))
month_level <-combineLevels(month_level2,c("apr","feb"),newLabel=c("Fe_Ap"))
table(month_level)
bank <-cbind(bank,month_level)
#Combine education levels to 2
edu_level1 <-combineLevels(bank$education,c("primary","secondary"),newLabel=c("Pri_Sec"))
edu_level <-combineLevels(edu_level1,c("tertiary","unknown"),newLabel=c("ter_unk"))
table(edu_level)
bank <-cbind(bank,edu_level)
#Combine marital levels to 2
marital_level <-combineLevels(bank$marital,c("single","divorced"),newLabel=c("Si_Di"))
table(marital_level)
bank <-cbind(bank,marital_level)
#Combine poutcome levels to 2
poutcome_level <-combineLevels(bank$poutcome,c("failure","other","unknown"),newLabel=c("rest"))
table(poutcome_level)
bank <-cbind(bank,poutcome_level)
#Correlation of Continuous Variables - Before Outlier reduction
cor(bank[,c(6,12,13,14,15)])
## balance duration campaign pdays previous
## balance 1.000000000 0.021560380 -0.01457828 0.003435322 0.016673637
## duration 0.021560380 1.000000000 -0.08456950 -0.001564770 0.001203057
## campaign -0.014578279 -0.084569503 1.00000000 -0.088627668 -0.032855290
## pdays 0.003435322 -0.001564770 -0.08862767 1.000000000 0.454819635
## previous 0.016673637 0.001203057 -0.03285529 0.454819635 1.000000000
#Correlation of Continuous Variables - After Outlier reduction
cor(bank[,c(19,20,21,22,23)])
## balance_imp campaign_imp duration_imp pdays_imp
## balance_imp 1.00000000 -0.02101371 0.03844600 0.05067016
## campaign_imp -0.02101371 1.00000000 -0.09887499 -0.11572108
## duration_imp 0.03844600 -0.09887499 1.00000000 0.01319840
## pdays_imp 0.05067016 -0.11572108 0.01319840 1.00000000
## previous_imp 0.05067016 -0.11572108 0.01319840 1.00000000
## previous_imp
## balance_imp 0.05067016
## campaign_imp -0.11572108
## duration_imp 0.01319840
## pdays_imp 1.00000000
## previous_imp 1.00000000
Correlation of Continuous variables after the outlier reduction shows that pdays and previous are correlated and hence one can be removed. Here I removed pdays.
Data Partition: Data has been separated to Training and Test dataset with probability of 0.5 (exactly split into two) based on the dependent variable.
#Data Classification to Train and Test
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
set.seed(1234)
train <- createDataPartition(bank$y, list=FALSE) #create 2 set of data by Random sampling
train
bank.train <- bank[train, ]
bank.test <- bank[-train, ] #-train is complement to train data
#Convert Yes No to 1 and 0 respectively
bank.train$ObsY <- ifelse (bank.train$y == "yes", 1,0)
bank.test$ObsY <- ifelse (bank.test$y == "yes", 1,0)
bank.test$ObsY <- as.integer(bank.test$ObsY)
bank.train$ObsY <- as.integer(bank.train$ObsY)
headTail(bank.train)
headTail(bank.test)
summary(bank.train)
## age job marital education
## Min. :18.00 blue-collar:4858 divorced: 2633 primary : 3408
## 1st Qu.:33.00 management :4719 married :13568 secondary:11625
## Median :39.00 technician :3795 single : 6405 tertiary : 6686
## Mean :40.95 admin. :2580 unknown : 887
## 3rd Qu.:48.00 services :2128
## Max. :95.00 retired :1107
## (Other) :3419
## default balance housing loan contact
## no :22210 Min. : -6847 no :10041 no :19013 cellular :14616
## yes: 396 1st Qu.: 76 yes:12565 yes: 3593 telephone: 1466
## Median : 453 unknown : 6524
## Mean : 1360
## 3rd Qu.: 1430
## Max. :102127
##
## day month duration campaign
## Min. : 1.00 may :6857 Min. : 0.0 Min. : 1.000
## 1st Qu.: 8.00 jul :3470 1st Qu.: 103.0 1st Qu.: 1.000
## Median :16.00 aug :3098 Median : 180.0 Median : 2.000
## Mean :15.84 jun :2740 Mean : 257.2 Mean : 2.789
## 3rd Qu.:21.00 nov :1953 3rd Qu.: 318.0 3rd Qu.: 3.000
## Max. :31.00 apr :1412 Max. :3881.0 Max. :63.000
## (Other):3076
## pdays previous poutcome y
## Min. : -1.00 Min. : 0.0000 failure: 2432 no :19961
## 1st Qu.: -1.00 1st Qu.: 0.0000 other : 909 yes: 2645
## Median : -1.00 Median : 0.0000 success: 784
## Mean : 39.97 Mean : 0.5834 unknown:18481
## 3rd Qu.: -1.00 3rd Qu.: 0.0000
## Max. :871.00 Max. :58.0000
##
## age_imp balance_imp campaign_imp duration_imp
## Min. :18.00 Min. :-1941 Min. :1.000 Min. : 0.0
## 1st Qu.:33.00 1st Qu.: 76 1st Qu.:1.000 1st Qu.:103.0
## Median :39.00 Median : 453 Median :2.000 Median :180.0
## Mean :40.75 Mean : 1172 Mean :2.543 Mean :241.9
## 3rd Qu.:48.00 3rd Qu.: 1430 3rd Qu.:3.000 3rd Qu.:318.0
## Max. :70.00 Max. : 5768 Max. :8.000 Max. :751.0
##
## pdays_imp previous_imp job_level
## Min. : -1.0 Min. :0.0000 st_re :1588
## 1st Qu.: -1.0 1st Qu.:0.0000 une_mgmt :5385
## Median : -1.0 Median :0.0000 bl_en_hm_se :8333
## Mean : 57.1 Mean :0.5481 ad_sel_te_unk:7300
## 3rd Qu.: -1.0 3rd Qu.:0.0000
## Max. :317.0 Max. :3.0000
##
## month_level edu_level marital_level
## Ja_Ma_Ju_Jul_Au_No:18846 Pri_Sec:15033 married:13568
## Mar_Se_Oc_De : 1029 ter_unk: 7573 Si_Di : 9038
## Fe_Ap : 2731
##
##
##
##
## poutcome_level ObsY
## success: 784 Min. :0.000
## rest :21822 1st Qu.:0.000
## Median :0.000
## Mean :0.117
## 3rd Qu.:0.000
## Max. :1.000
##
summary(bank.test)
## age job marital education
## Min. :18.00 blue-collar:4874 divorced: 2574 primary : 3443
## 1st Qu.:33.00 management :4739 married :13646 secondary:11577
## Median :39.00 technician :3802 single : 6385 tertiary : 6615
## Mean :40.93 admin. :2591 unknown : 970
## 3rd Qu.:48.00 services :2026
## Max. :95.00 retired :1157
## (Other) :3416
## default balance housing loan contact
## no :22186 Min. :-8019 no :10040 no :18954 cellular :14669
## yes: 419 1st Qu.: 69 yes:12565 yes: 3651 telephone: 1440
## Median : 444 unknown : 6496
## Mean : 1365
## 3rd Qu.: 1423
## Max. :98417
##
## day month duration campaign
## Min. : 1.00 may :6909 Min. : 0.0 Min. : 1.000
## 1st Qu.: 8.00 jul :3425 1st Qu.: 103.0 1st Qu.: 1.000
## Median :16.00 aug :3149 Median : 180.0 Median : 2.000
## Mean :15.77 jun :2601 Mean : 259.2 Mean : 2.739
## 3rd Qu.:21.00 nov :2017 3rd Qu.: 319.0 3rd Qu.: 3.000
## Max. :31.00 apr :1520 Max. :4918.0 Max. :58.000
## (Other):2984
## pdays previous poutcome y
## Min. : -1.00 Min. : 0.0000 failure: 2469 no :19961
## 1st Qu.: -1.00 1st Qu.: 0.0000 other : 931 yes: 2644
## Median : -1.00 Median : 0.0000 success: 727
## Mean : 40.42 Mean : 0.5772 unknown:18478
## 3rd Qu.: -1.00 3rd Qu.: 0.0000
## Max. :842.00 Max. :275.0000
##
## age_imp balance_imp campaign_imp duration_imp
## Min. :18.00 Min. :-1944 Min. :1.000 Min. : 0.0
## 1st Qu.:33.00 1st Qu.: 69 1st Qu.:1.000 1st Qu.:103.0
## Median :39.00 Median : 444 Median :2.000 Median :180.0
## Mean :40.74 Mean : 1178 Mean :2.512 Mean :243.4
## 3rd Qu.:48.00 3rd Qu.: 1423 3rd Qu.:3.000 3rd Qu.:319.0
## Max. :70.00 Max. : 5768 Max. :8.000 Max. :751.0
##
## pdays_imp previous_imp job_level
## Min. : -1.00 Min. :0.0000 st_re :1614
## 1st Qu.: -1.00 1st Qu.:0.0000 une_mgmt :5376
## Median : -1.00 Median :0.0000 bl_en_hm_se :8280
## Mean : 57.06 Mean :0.5477 ad_sel_te_unk:7335
## 3rd Qu.: -1.00 3rd Qu.:0.0000
## Max. :317.00 Max. :3.0000
##
## month_level edu_level marital_level
## Ja_Ma_Ju_Jul_Au_No:18776 Pri_Sec:15020 married:13646
## Mar_Se_Oc_De : 979 ter_unk: 7585 Si_Di : 8959
## Fe_Ap : 2850
##
##
##
##
## poutcome_level ObsY
## success: 727 Min. :0.000
## rest :21878 1st Qu.:0.000
## Median :0.000
## Mean :0.117
## 3rd Qu.:0.000
## Max. :1.000
##
Logistic regression with original data (unstandardized) is performed to analyse the significant variable. For every n level of categorical variable, R automatically creates n-1 dummy variable.
#Logistic Regression - Original Data
bank.glm <- glm(y ~ age + job + marital + education + default
+ balance + housing + loan + contact + day
+ month + duration + campaign + pdays + previous
+ poutcome, data = bank.train, family = "binomial")
summary(bank.glm)
##
## Call:
## glm(formula = y ~ age + job + marital + education + default +
## balance + housing + loan + contact + day + month + duration +
## campaign + pdays + previous + poutcome, family = "binomial",
## data = bank.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.9144 -0.3752 -0.2510 -0.1485 3.5216
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.318e+00 2.604e-01 -8.900 < 2e-16 ***
## age -4.333e-03 3.137e-03 -1.381 0.16720
## jobblue-collar -4.108e-01 1.050e-01 -3.911 9.20e-05 ***
## jobentrepreneur -3.672e-01 1.818e-01 -2.020 0.04339 *
## jobhousemaid -5.240e-01 1.894e-01 -2.766 0.00567 **
## jobmanagement -1.471e-01 1.039e-01 -1.416 0.15672
## jobretired 2.675e-01 1.411e-01 1.897 0.05789 .
## jobself-employed -7.208e-02 1.549e-01 -0.465 0.64176
## jobservices -6.474e-02 1.147e-01 -0.565 0.57240
## jobstudent 4.427e-01 1.527e-01 2.899 0.00374 **
## jobtechnician -1.139e-01 9.802e-02 -1.162 0.24541
## jobunemployed -1.146e-01 1.527e-01 -0.750 0.45309
## jobunknown -6.853e-01 3.733e-01 -1.836 0.06639 .
## maritalmarried -1.551e-01 8.293e-02 -1.870 0.06153 .
## maritalsingle -3.267e-02 9.548e-02 -0.342 0.73223
## educationsecondary 8.450e-02 9.091e-02 0.930 0.35260
## educationtertiary 2.601e-01 1.053e-01 2.470 0.01352 *
## educationunknown 2.087e-02 1.531e-01 0.136 0.89153
## defaultyes -1.249e-01 2.440e-01 -0.512 0.60874
## balance 1.745e-05 7.044e-06 2.477 0.01326 *
## housingyes -6.944e-01 6.203e-02 -11.195 < 2e-16 ***
## loanyes -4.447e-01 8.537e-02 -5.209 1.90e-07 ***
## contacttelephone -2.000e-01 1.070e-01 -1.870 0.06145 .
## contactunknown -1.668e+00 1.039e-01 -16.050 < 2e-16 ***
## day 1.519e-02 3.509e-03 4.330 1.49e-05 ***
## monthaug -8.444e-01 1.118e-01 -7.550 4.36e-14 ***
## monthdec 4.133e-01 2.491e-01 1.659 0.09715 .
## monthfeb -7.013e-02 1.251e-01 -0.561 0.57492
## monthjan -1.450e+00 1.706e-01 -8.504 < 2e-16 ***
## monthjul -9.436e-01 1.087e-01 -8.678 < 2e-16 ***
## monthjun 3.270e-01 1.321e-01 2.475 0.01333 *
## monthmar 1.446e+00 1.686e-01 8.577 < 2e-16 ***
## monthmay -4.499e-01 1.017e-01 -4.425 9.65e-06 ***
## monthnov -1.064e+00 1.207e-01 -8.811 < 2e-16 ***
## monthoct 7.579e-01 1.520e-01 4.986 6.15e-07 ***
## monthsep 7.523e-01 1.667e-01 4.512 6.42e-06 ***
## duration 4.155e-03 9.162e-05 45.352 < 2e-16 ***
## campaign -1.041e-01 1.485e-02 -7.011 2.36e-12 ***
## pdays -2.230e-04 4.366e-04 -0.511 0.60954
## previous 3.304e-02 1.306e-02 2.530 0.01139 *
## poutcomeother 2.277e-01 1.283e-01 1.774 0.07609 .
## poutcomesuccess 2.370e+00 1.159e-01 20.445 < 2e-16 ***
## poutcomeunknown 6.147e-02 1.363e-01 0.451 0.65189
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 16318 on 22605 degrees of freedom
## Residual deviance: 10794 on 22563 degrees of freedom
## AIC: 10880
##
## Number of Fisher Scoring iterations: 6
confint(bank.glm)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -2.829593e+00 -1.808570e+00
## age -1.049154e-02 1.805619e-03
## jobblue-collar -6.166407e-01 -2.047687e-01
## jobentrepreneur -7.314016e-01 -1.803935e-02
## jobhousemaid -9.033648e-01 -1.600604e-01
## jobmanagement -3.501340e-01 5.719013e-02
## jobretired -9.623689e-03 5.434802e-01
## jobself-employed -3.796936e-01 2.280504e-01
## jobservices -2.903481e-01 1.593643e-01
## jobstudent 1.418719e-01 7.406701e-01
## jobtechnician -3.055539e-01 7.877701e-02
## jobunemployed -4.172646e-01 1.816343e-01
## jobunknown -1.464055e+00 6.935223e-03
## maritalmarried -3.162833e-01 8.892838e-03
## maritalsingle -2.189260e-01 1.554259e-01
## educationsecondary -9.250782e-02 2.639275e-01
## educationtertiary 5.453223e-02 4.673857e-01
## educationunknown -2.823900e-01 3.179040e-01
## defaultyes -6.305556e-01 3.295936e-01
## balance 3.502218e-06 3.123946e-05
## housingyes -8.163315e-01 -5.731356e-01
## loanyes -6.141616e-01 -2.793998e-01
## contacttelephone -4.124924e-01 6.925721e-03
## contactunknown -1.873302e+00 -1.465797e+00
## day 8.318117e-03 2.207532e-02
## monthaug -1.063598e+00 -6.250671e-01
## monthdec -7.841469e-02 8.994774e-01
## monthfeb -3.156536e-01 1.746693e-01
## monthjan -1.790998e+00 -1.121816e+00
## monthjul -1.156907e+00 -7.305166e-01
## monthjun 6.799700e-02 5.860070e-01
## monthmar 1.115709e+00 1.777115e+00
## monthmay -6.487159e-01 -2.500866e-01
## monthnov -1.301809e+00 -8.283232e-01
## monthoct 4.591538e-01 1.055172e+00
## monthsep 4.249188e-01 1.078787e+00
## duration 3.976978e-03 4.336163e-03
## campaign -1.338818e-01 -7.567839e-02
## pdays -1.083564e-03 6.284931e-04
## previous 6.470268e-03 5.793068e-02
## poutcomeother -2.573233e-02 4.776252e-01
## poutcomesuccess 2.144159e+00 2.598659e+00
## poutcomeunknown -2.048900e-01 3.293991e-01
Logistic Regression shows, Job: Blue-collar, entrepreneur, housemaid, student; Education: Tertiary; Balance; housing loan; personal loan; contact: unknown; days; months: all except dec & feb; duration; campaign; previous; poutcome: success along with their reference variables are statistically significant in influencing the outcome of term deposit subscription. Their confidence intervals are not overlapping with 0 as well. The Estimate (Beta) shows the odds.
Logistic Regression with Standardized data (Outlier reduction and level standardization for categorical variables) is performed to analyse the influence of variables on the outcome.
#Logistic Regression - Standarized Data. poutcome removed due to its correlation with previous campaign
bank.glm1 <- glm(y ~ age_imp + job_level + marital_level + edu_level + default
+ balance_imp + housing + loan + contact + day
+ month_level + duration_imp + campaign_imp + previous_imp
+ poutcome_level, data = bank.train, family = "binomial")
summary(bank.glm1)
##
## Call:
## glm(formula = y ~ age_imp + job_level + marital_level + edu_level +
## default + balance_imp + housing + loan + contact + day +
## month_level + duration_imp + campaign_imp + previous_imp +
## poutcome_level, family = "binomial", data = bank.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2620 -0.3615 -0.2260 -0.1393 3.1544
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.320e-01 2.038e-01 -2.610 0.00905 **
## age_imp -1.142e-02 2.668e-03 -4.280 1.87e-05 ***
## job_levelune_mgmt -4.804e-01 9.946e-02 -4.830 1.37e-06 ***
## job_levelbl_en_hm_se -6.643e-01 9.745e-02 -6.816 9.35e-12 ***
## job_levelad_sel_te_unk -4.284e-01 9.329e-02 -4.592 4.40e-06 ***
## marital_levelSi_Di 1.646e-01 5.540e-02 2.971 0.00297 **
## edu_levelter_unk 1.439e-01 6.164e-02 2.335 0.01953 *
## defaultyes -2.255e-01 2.504e-01 -0.901 0.36776
## balance_imp 8.501e-05 1.364e-05 6.230 4.67e-10 ***
## housingyes -6.982e-01 5.604e-02 -12.458 < 2e-16 ***
## loanyes -4.345e-01 8.254e-02 -5.264 1.41e-07 ***
## contacttelephone -1.988e-01 1.057e-01 -1.880 0.06007 .
## contactunknown -9.439e-01 8.037e-02 -11.744 < 2e-16 ***
## day 1.692e-03 3.091e-03 0.547 0.58412
## month_levelMar_Se_Oc_De 1.688e+00 8.890e-02 18.987 < 2e-16 ***
## month_levelFe_Ap 6.635e-01 6.969e-02 9.521 < 2e-16 ***
## duration_imp 5.879e-03 1.171e-04 50.198 < 2e-16 ***
## campaign_imp -9.764e-02 1.631e-02 -5.985 2.17e-09 ***
## previous_imp 4.816e-02 2.348e-02 2.051 0.04028 *
## poutcome_levelrest -2.386e+00 1.066e-01 -22.379 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 16318 on 22605 degrees of freedom
## Residual deviance: 10646 on 22586 degrees of freedom
## AIC: 10686
##
## Number of Fisher Scoring iterations: 6
confint(bank.glm1)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -0.9315878401 -0.1324864051
## age_imp -0.0166546744 -0.0061967706
## job_levelune_mgmt -0.6748022469 -0.2848617685
## job_levelbl_en_hm_se -0.8547156219 -0.4726372072
## job_levelad_sel_te_unk -0.6105211379 -0.2447579610
## marital_levelSi_Di 0.0559515719 0.2731330278
## edu_levelter_unk 0.0229260732 0.2645773111
## defaultyes -0.7420922272 0.2429016000
## balance_imp 0.0000581443 0.0001116388
## housingyes -0.8083197482 -0.5886022343
## loanyes -0.5981490412 -0.2744990594
## contacttelephone -0.4085118204 0.0061072290
## contactunknown -1.1030303157 -0.7878752115
## day -0.0043688521 0.0077492129
## month_levelMar_Se_Oc_De 1.5134700862 1.8619940750
## month_levelFe_Ap 0.5265487244 0.7997844527
## duration_imp 0.0056508636 0.0061100178
## campaign_imp -0.1299331340 -0.0659662839
## previous_imp 0.0019044272 0.0939686072
## poutcome_levelrest -2.5961609746 -2.1781275004
The analysis shows that age, job, education, balance, housing loan, personal loan, contact: unknown & cellular, months, duration, campaign, previous, poutcome became statistically significant variables in influencing the term deposit subscription.
Data Prediction has been carried out with respect to the original data and standardized dataset.
#Prediction - Original data
bank.test$predSub <- predict.glm(bank.glm, newdata = bank.test, type = "response")
headTail(bank.test)
#Prediction - Standardized data
bank.test$predSub1 <- predict.glm(bank.glm1, newdata = bank.test, type = "response")
headTail(bank.test)
INFERENCES FROM LOGISTIC REGRESSION OF IMBALANCED DATA
#Confusion Matrix - Original data
library(SDMTools)
##
## Attaching package: 'SDMTools'
## The following objects are masked from 'package:caret':
##
## sensitivity, specificity
confusion.matrix(bank.test$ObsY, bank.test$predSub, threshold = 0.5)
## obs
## pred 0 1
## 0 19491 1743
## 1 470 901
## attr(,"class")
## [1] "confusion.matrix"
accuracy(bank.test$ObsY, bank.test$predSub, threshold = 0.5)
#Confusion Matrix - Standarized data
library(SDMTools)
confusion.matrix(bank.test$ObsY, bank.test$predSub1, threshold = 0.5)
## obs
## pred 0 1
## 0 19396 1694
## 1 565 950
## attr(,"class")
## [1] "confusion.matrix"
accuracy(bank.test$ObsY, bank.test$predSub1, threshold = 0.5)
Confusion Matrix of Standardized data has better prediction of True Positive (including Sensitivity) while the Original data has better prediction of True Negative (Specificity). The accuracy of the model is better for Standardized dataset (66.55%) when compared to the original dataset (65.86%). Cohen’s Kappa which is the classification accuracy normalized by the imbalance of the classes in the data is little better for the Standardized data (0.406) when compared to the original data (0.401). Landis and Koch (1977) provide a way to characterize values. According to their scheme a value < 0 is indicating no agreement, 0-0.20 as slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1 as almost perfect agreement
#ROC Curve - Original data
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following object is masked from 'package:SDMTools':
##
## auc
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
myROC <- roc(bank.test$y, bank.test$predSub)
myROC
##
## Call:
## roc.default(response = bank.test$y, predictor = bank.test$predSub)
##
## Data: bank.test$predSub in 19961 controls (bank.test$y no) < 2644 cases (bank.test$y yes).
## Area under the curve: 0.9059
plot(myROC)
#ROC Curve - Standarized data
library(pROC)
myROC1 <- roc(bank.test$y, bank.test$predSub1)
myROC1
##
## Call:
## roc.default(response = bank.test$y, predictor = bank.test$predSub1)
##
## Data: bank.test$predSub1 in 19961 controls (bank.test$y no) < 2644 cases (bank.test$y yes).
## Area under the curve: 0.9086
plot(myROC1)
AUC of Standardized data is better at 90.86% when compared to the original data 90.59%
Optimum Threshold: To find the optimal threshold, I created a plot from the ROC attributes - Sensitivity, Specificity and Threshold. The point where both Sensitivity and Specificity are intersecting, the corresponding threshold is the optimum.
# look at TPR (Sensitivity) and TNR (Specificity) distribution over threshold for Original data
matplot(data.frame(myROC$sensitivities, myROC$specificities), x = myROC$thresholds, type='l', xlab = 'threshold', ylab='TPR, TNR')
legend('bottomright', legend=c('TPR', 'TNR'), lty=1:2, col=1:2)
# Confusion matrix for Original data with Optimal threshold of 0.1
confusion.matrix(bank.test$ObsY, bank.test$predSub, threshold = 0.1)
## obs
## pred 0 1
## 0 16184 372
## 1 3777 2272
## attr(,"class")
## [1] "confusion.matrix"
accuracy(bank.test$ObsY, bank.test$predSub,0.10)
# look at TPR (Sensitivity) and TNR (Specificity) distribution over threshold for Standardized data
matplot(data.frame(myROC1$sensitivities, myROC1$specificities), x = myROC1$thresholds, type='l', xlab = 'threshold', ylab='TPR, TNR')
legend('bottomright', legend=c('TPR', 'TNR'), lty=1:2, col=1:2)
# Confusion matrix for Standardized data with Optimal threshold of 0.1
confusion.matrix(bank.test$ObsY, bank.test$predSub1, threshold = 0.1)
## obs
## pred 0 1
## 0 16127 374
## 1 3834 2270
## attr(,"class")
## [1] "confusion.matrix"
accuracy(bank.test$ObsY, bank.test$predSub1,0.1)
With both Standardized and Original datasets, the optimum threshold is 0.1 which means that if the probability is >0.1, we should conclude that the subscriber will subscribe to the term deposit. Sensitivity, Specificity and Accuracy are greater than 80%.
DATA BALANCING: As the dataset is imbalanced, we need to perform data balancing to improve the prediction. There are many methods available like Under Sampling, Over Sampling, SMOTE etc. For this model, I used Random Over Sampling method (ROSE) with Standardized data.
Iteration with Balancing through ROSE method for Standardized data
#Random Over Sampling (ROSE) for Original data and perform Logistic Regression
library(ROSE)
## Loaded ROSE 0.0-3
#ROSE
data_rose <- ROSE(y ~ age_imp + job_level + marital_level + edu_level + default
+ balance_imp + housing + loan + contact + day
+ month_level + duration_imp + campaign_imp + pdays_imp + previous_imp
+ poutcome_level, data = bank.train, seed = 1)$data
table(data_rose$y)
##
## no yes
## 11372 11234
As we can see above, ROSE made the minority positive outcome to be equal to majority negative outcome in the train data. This can be used to perform logistic regression and predict the test data.
LOGISTIC REGRESSION OF BALANCED STANDARDIZED DATA: Logistic regression with Standardized and balanced data is performed as given below
#build logistic model Balanced data
glm.rose <- glm(y ~ age_imp + job_level + marital_level + edu_level + default
+ balance_imp + housing + loan + contact + day
+ month_level + duration_imp + campaign_imp + pdays_imp + previous_imp
+ poutcome_level,data = data_rose, family = "binomial")
summary(glm.rose)
##
## Call:
## glm(formula = y ~ age_imp + job_level + marital_level + edu_level +
## default + balance_imp + housing + loan + contact + day +
## month_level + duration_imp + campaign_imp + pdays_imp + previous_imp +
## poutcome_level, family = "binomial", data = data_rose)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.8615 -0.6247 -0.1497 0.6465 3.0901
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.675e+00 1.528e-01 10.964 < 2e-16 ***
## age_imp -1.050e-02 1.709e-03 -6.146 7.94e-10 ***
## job_levelune_mgmt -5.313e-01 7.480e-02 -7.103 1.22e-12 ***
## job_levelbl_en_hm_se -7.050e-01 7.180e-02 -9.819 < 2e-16 ***
## job_levelad_sel_te_unk -4.409e-01 7.003e-02 -6.296 3.05e-10 ***
## marital_levelSi_Di 1.792e-01 3.893e-02 4.602 4.18e-06 ***
## edu_levelter_unk 2.083e-01 4.452e-02 4.678 2.90e-06 ***
## defaultyes -2.771e-01 1.619e-01 -1.712 0.08695 .
## balance_imp 9.738e-05 9.221e-06 10.561 < 2e-16 ***
## housingyes -7.666e-01 3.945e-02 -19.435 < 2e-16 ***
## loanyes -4.643e-01 5.540e-02 -8.380 < 2e-16 ***
## contacttelephone -1.124e-01 7.425e-02 -1.514 0.13015
## contactunknown -1.049e+00 5.246e-02 -20.000 < 2e-16 ***
## day -5.683e-03 2.008e-03 -2.830 0.00465 **
## month_levelMar_Se_Oc_De 2.007e+00 7.397e-02 27.131 < 2e-16 ***
## month_levelFe_Ap 8.579e-01 5.027e-02 17.066 < 2e-16 ***
## duration_imp 5.901e-03 8.830e-05 66.825 < 2e-16 ***
## campaign_imp -1.122e-01 9.931e-03 -11.295 < 2e-16 ***
## pdays_imp 4.114e-04 2.312e-04 1.779 0.07516 .
## previous_imp 2.451e-02 2.463e-02 0.995 0.31961
## poutcome_levelrest -2.545e+00 9.805e-02 -25.958 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 31338 on 22605 degrees of freedom
## Residual deviance: 19088 on 22585 degrees of freedom
## AIC: 19130
##
## Number of Fisher Scoring iterations: 5
This shows that loan default, contact: telephone, pdays and previous are statistically not significant in predicting the outcome.
This logistic regression performed on the train data is used to predict the test data as given below
#make predictions with Balanced data
pred.glm.rose <- predict.glm(glm.rose, newdata = bank.test, type = "response")
INFERENCES FROM LOGISTIC REGRESSION OF BALANCED DATA
# Confusion Matrix
confusion.matrix(bank.test$ObsY, pred.glm.rose)
## obs
## pred 0 1
## 0 16642 455
## 1 3319 2189
## attr(,"class")
## [1] "confusion.matrix"
accuracy(bank.test$ObsY, pred.glm.rose)
Confusion Matrix shows that sensitivity, specificity and accuracy are greater than 82% and balanced. Moreover this prediction is for threshold 0.5 while the earlier model has the optimum threshold of 0.1
#AUC ROSE
myROCS <-roc(bank.test$ObsY, pred.glm.rose)
myROCS
##
## Call:
## roc.default(response = bank.test$ObsY, predictor = pred.glm.rose)
##
## Data: pred.glm.rose in 19961 controls (bank.test$ObsY 0) < 2644 cases (bank.test$ObsY 1).
## Area under the curve: 0.908
plot(myROCS)
AUC remains same of 90.8% when compared to the earlier model
# look at TPR (Sensitivity) and TNR (Specificity) distribution over threshold for Original data
matplot(data.frame(myROCS$sensitivities, myROCS$specificities), x = myROCS$thresholds, type='l', xlab = 'threshold', ylab='TPR, TNR')
legend('bottomright', legend=c('TPR', 'TNR'), lty=1:2, col=1:2)
The optimum threshold is 0.5 since that data is perfectly balanced. Customers with predicted probabilities greater than 0.5 will subscribe to the term deposit.
DECISION TREE FOR STANDARDIZED BALANCED DATA Since this is a classification problem, this can also be addressed through decision tree. Here I took the standardized and balanced dataset (ROSE) to create decision tree. Decision tree is created based on Recursive Partitioning algorithm (rpart).
# Decision Tree for standardised data with ROSE Balancing
library(rpart)
tree.rose <- rpart(y ~., data = data_rose)
tree.rose
## n= 22606
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 22606 11234 no (0.5030523 0.4969477)
## 2) duration_imp< 299.0133 12482 3815 no (0.6943599 0.3056401)
## 4) poutcome_level=rest 11273 2729 no (0.7579171 0.2420829)
## 8) month_level=Ja_Ma_Ju_Jul_Au_No 8705 1251 no (0.8562895 0.1437105) *
## 9) month_level=Mar_Se_Oc_De,Fe_Ap 2568 1090 yes (0.4244548 0.5755452)
## 18) housing=yes 972 353 no (0.6368313 0.3631687) *
## 19) housing=no 1596 471 yes (0.2951128 0.7048872) *
## 5) poutcome_level=success 1209 123 yes (0.1017370 0.8982630) *
## 3) duration_imp>=299.0133 10124 2705 yes (0.2671869 0.7328131)
## 6) duration_imp< 511.7324 4562 1812 yes (0.3971942 0.6028058)
## 12) contact=unknown 709 147 no (0.7926657 0.2073343) *
## 13) contact=cellular,telephone 3853 1250 yes (0.3244225 0.6755775) *
## 7) duration_imp>=511.7324 5562 893 yes (0.1605538 0.8394462) *
library(rpart.plot)
rpart.plot(tree.rose, nn = TRUE)
As we can infer from the above, the duration of the call is the root node with probability of not subscribing to the term deposit is 50%. If the duration of the call is greater than 299 seconds, the probability of positive outcome is 73% (45% of the total population). Within this, if the duration if greater than 512 seconds, the probability of positive outcome is 84% (25% of the overall population). If the duration of the call is less than 512 seconds, the probability of positive outcome is 60% (20% of the total population). Within this, if the contact mode is cellular or telephone (not unknown), the probability of positive outcome is 68% (17% of the population). If the duration of the call is less than 299 seconds and if the probability of previous outcome is success, the probability of positive outcome is 90% (5% of the population). If the duration if less than 299 seconds and if the previous outcome is failure or unknown and if the month of contact is Feb, Apr, May, Sep, Oct, Dec, the probability of positive outcome is 58% (11% of the population). Within this, if the person is not having housing loan, then the probability of positive outcome is 70% (7% of the population). This decision tree probabilities from train data is used to predict the test data.
#Decision Tree Prediction
pred.tree.rose <- predict(tree.rose, newdata = bank.test)
pred.tree.rose
#Confusion Matrix
library(SDMTools)
confusion.matrix(bank.test$ObsY, pred.tree.rose[,2])
## obs
## pred 0 1
## 0 15393 396
## 1 4568 2248
## attr(,"class")
## [1] "confusion.matrix"
accuracy(bank.test$ObsY, pred.tree.rose[,2])
The confusion matrix for decision tree shows that True positive (and Sensitivity) is increased while the True negative (and Specificity) is decreased. Accuracy is also reduced to 81% when compared to Logistic Regression. Cohen’s Kappa is also lesser when compared to Logistic Regression.
#ROC Curve
library(pROC)
myROCD <- roc(bank.test$y, pred.tree.rose[,2])
myROCD
##
## Call:
## roc.default(response = bank.test$y, predictor = pred.tree.rose[, 2])
##
## Data: pred.tree.rose[, 2] in 19961 controls (bank.test$y no) < 2644 cases (bank.test$y yes).
## Area under the curve: 0.8448
plot(myROCD)
AUC has been decreased a lot to 84.48% when compared to the Logistic Regression.
# look at TPR (Sensitivity) and TNR (Specificity) distribution over threshold
matplot(data.frame(myROCD$sensitivities, myROCD$specificities), x = myROCD$thresholds, type='l', xlab = 'threshold', ylab='TPR, TNR')
legend('bottomright', legend=c('TPR', 'TNR'), lty=1:2, col=1:2)
confusion.matrix(bank.test$ObsY, pred.tree.rose[,2], threshold = 0.55)
## obs
## pred 0 1
## 0 15393 396
## 1 4568 2248
## attr(,"class")
## [1] "confusion.matrix"
accuracy(bank.test$ObsY, pred.tree.rose[,2],0.55)
The optimum threshold is found to be 0.55. The specificity still remains less than 80%.
INFERENCES ON THE MODEL: Per the extensive data analysis done using logistic regression models (with structured data, with original data, with structured and balanced data) and decision tree model (with structured and balanced data), it is inferred that the best model among these in doing the prediction is Logistic Regression with Structured and Balanced Data. The Accuracy of the model is 83% with Sensitivity of 82.79%, Specificity of 83.37% and with threshold of 0.5. Cohen’s Kappa of 0.45 shows that the model has moderate fit. AUC is good at 90.8%