DA-6813 Case Study 1: Bank Marketing

Preliminary Exploration.

What’s in this thing? Run Structure evaluation.

#str(bank.orig)

If we do logistic regression, we’ll need to convert y from “yes / no” to “1/0.” We’ll also be doing GLM instead of LM.

bank.orig<-bank.orig %>% mutate(y_binary = ifelse(y == "no", 0, 1))
bank.orig$y_binary<-as.integer(bank.orig$y_binary) #For GLM regression purposes.
#I convert the "chr" into "factors" here.  If this causes problems we can comment these out later.
bank.orig$job<-as.factor(bank.orig$job)
bank.orig$marital<-as.factor(bank.orig$marital)
bank.orig$education<-as.factor(bank.orig$education)
bank.orig$default<-as.factor(bank.orig$default)
bank.orig$housing<-as.factor(bank.orig$housing)
bank.orig$loan<-as.factor(bank.orig$loan)
bank.orig$contact<-as.factor(bank.orig$contact)
bank.orig$month<-as.factor(bank.orig$month)
bank.orig$day_of_week<-as.factor(bank.orig$day_of_week)
bank.orig$poutcome<-as.factor(bank.orig$poutcome)
bank.orig$y<-as.factor(bank.orig$y)

OK, there we go. The output “y” variable = has the client subscribed a term deposit? Yes = 1, No = 0. Let’s break down the input variables:

age: (numeric) person’s age.
job: (categorical) type of job - includes unknown.
marital: (categorical) marital status - divorced could mean divorced or widowed - includes unknown.
education: (categorical) level of education - includes unknown.
default: (categorical) has credit in default already? - includes unknown.
housing: (categorical) has housing loan? -includes unknown.
loan: (categorical) has a personal loan? - includes unknown and is related to last contact of current campaign.
contact: (categorical) contact communication type.
month: (categorical) last contact month of year.
day_of_week: (categorical) last contact day of the week.
duration: (numeric) last contact duration, in seconds.

“Other” attributes are:

campaign: (numeric) number of contacts performed during this campaign for this client.
pdays: (numeric, and 999 means client never prev. contacted) number of days that passed after client last contacted from a previous campaign.
previous: (numeric) number of contacts performed before this campaign and for this client.
poutcome: (categorical) outcome of the previous marketing campaign.

“Social / economic” context attributes

emp.var.rate: (numeric) employment variation rate - quarterly indicator.
cons.price.idx: (numeric) consumer price index - monthly indicator.
cons.conf.idx: (numeric) consumer confidence index - monthly indicator.
euribor3m: (numeric) euribor 3 month rate.
nr.employed: (numeric) number of employees - quarterly indicator.

Data Evaluation.

How many rows and columns does our dataset have?

ncol(bank.orig)

## [1] 22

22 columns, which includes the y_binary factor.

How many rows?

nrow(bank.orig)

## [1] 4119

Quick information on the data.

summary(bank.orig)

##       age                 job           marital                   education   
##  Min.   :18.00   admin.     :1012   divorced: 446   university.degree  :1264  
##  1st Qu.:32.00   blue-collar: 884   married :2509   high.school        : 921  
##  Median :38.00   technician : 691   single  :1153   basic.9y           : 574  
##  Mean   :40.11   services   : 393   unknown :  11   professional.course: 535  
##  3rd Qu.:47.00   management : 324                   basic.4y           : 429  
##  Max.   :88.00   retired    : 166                   basic.6y           : 228  
##                  (Other)    : 649                   (Other)            : 168  
##     default        housing          loan           contact         month     
##  no     :3315   no     :1839   no     :3349   cellular :2652   may    :1378  
##  unknown: 803   unknown: 105   unknown: 105   telephone:1467   jul    : 711  
##  yes    :   1   yes    :2175   yes    : 665                    aug    : 636  
##                                                                jun    : 530  
##                                                                nov    : 446  
##                                                                apr    : 215  
##                                                                (Other): 203  
##  day_of_week    duration         campaign          pdays          previous     
##  fri:768     Min.   :   0.0   Min.   : 1.000   Min.   :  0.0   Min.   :0.0000  
##  mon:855     1st Qu.: 103.0   1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.0000  
##  thu:860     Median : 181.0   Median : 2.000   Median :999.0   Median :0.0000  
##  tue:841     Mean   : 256.8   Mean   : 2.537   Mean   :960.4   Mean   :0.1903  
##  wed:795     3rd Qu.: 317.0   3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.0000  
##              Max.   :3643.0   Max.   :35.000   Max.   :999.0   Max.   :6.0000  
##                                                                                
##         poutcome     emp.var.rate      cons.price.idx  cons.conf.idx  
##  failure    : 454   Min.   :-3.40000   Min.   :92.20   Min.   :-50.8  
##  nonexistent:3523   1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7  
##  success    : 142   Median : 1.10000   Median :93.75   Median :-41.8  
##                     Mean   : 0.08497   Mean   :93.58   Mean   :-40.5  
##                     3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4  
##                     Max.   : 1.40000   Max.   :94.77   Max.   :-26.9  
##                                                                       
##    euribor3m      nr.employed     y           y_binary     
##  Min.   :0.635   Min.   :4964   no :3668   Min.   :0.0000  
##  1st Qu.:1.334   1st Qu.:5099   yes: 451   1st Qu.:0.0000  
##  Median :4.857   Median :5191              Median :0.0000  
##  Mean   :3.621   Mean   :5166              Mean   :0.1095  
##  3rd Qu.:4.961   3rd Qu.:5228              3rd Qu.:0.0000  
##  Max.   :5.045   Max.   :5228              Max.   :1.0000  
##

Let’s take a closer look at Y and also pdays (those values make it look like the vast majority of people were never contacted previously).

hist(bank.orig$pdays)

summary(bank.orig$y)

##   no  yes 
## 3668  451

table(bank.orig$pdays)

## 
##    0    1    2    3    4    5    6    7    9   10   11   12   13   14   15   16 
##    2    3    4   52   14    4   42   10    3    8    1    5    2    1    2    2 
##   17   18   19   21  999 
##    1    2    1    1 3959

Out of 4119 observations, 3959 have pdays = 999 (people who were never contacted previously).

#This is the % of observations classified as 999 for pdays.  It is the vast majority.
sum(bank.orig$pdays == 999) / nrow(bank.orig)

## [1] 0.9611556

About pdays, making a binary variable out of it

The pdays variable has me thinking that the importance in this variable isn’t so much how many days passed by after the client was last contacted, but that the client was re-contacted AT ALL. So for pdays, I’m creating a pdays_binary to indicate whether a client had been recontacted or not after a previous campaign. It’s 0 if the client hadn’t been contacted, and 1 if they had.

bank.orig<-bank.orig %>% mutate(pdays_binary = ifelse(pdays == 999, 0, 1))
bank.orig$pdays_binary<-as.integer(bank.orig$pdays_binary)

And what about y_binary? For GLM, we’ll need the output variable something like 0 or 1 and a factor.

table(bank.orig$y_binary)

## 
##    0    1 
## 3668  451

This works out to about 11% where a client has subscribed for a term deposit.

#CrossTable is something I saw on Youtube.  It's part of the "gmodels" library, and is pretty awesome.
CrossTable(bank.orig$y)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4119 
## 
##  
##           |        no |       yes | 
##           |-----------|-----------|
##           |      3668 |       451 | 
##           |     0.891 |     0.109 | 
##           |-----------|-----------|
## 
## 
## 
##

Is there any missing data we have to deal with?

colSums(is.na(bank.orig)) #Is there anything marked NA?

##            age            job        marital      education        default 
##              0              0              0              0              0 
##        housing           loan        contact          month    day_of_week 
##              0              0              0              0              0 
##       duration       campaign          pdays       previous       poutcome 
##              0              0              0              0              0 
##   emp.var.rate cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
##              0              0              0              0              0 
##              y       y_binary   pdays_binary 
##              0              0              0

#No NA fields.

Is there anything that has nothing in it?

colSums(bank.orig==" ") #All clear here.

##            age            job        marital      education        default 
##              0              0              0              0              0 
##        housing           loan        contact          month    day_of_week 
##              0              0              0              0              0 
##       duration       campaign          pdays       previous       poutcome 
##              0              0              0              0              0 
##   emp.var.rate cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
##              0              0              0              0              0 
##              y       y_binary   pdays_binary 
##              0              0              0

I did notice some “unknown” entries in the summaries above. What about those?

colSums(bank.orig == "unknown")

##            age            job        marital      education        default 
##              0             39             11            167            803 
##        housing           loan        contact          month    day_of_week 
##            105            105              0              0              0 
##       duration       campaign          pdays       previous       poutcome 
##              0              0              0              0              0 
##   emp.var.rate cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
##              0              0              0              0              0 
##              y       y_binary   pdays_binary 
##              0              0              0

There are a total of 1230 instances of missing data in the dataset. That ends up being almost 20% of our data (19.5%).

rows_with_unknown = c()
for (i in seq(4119)){
 temp = any(bank.orig[i,] == "unknown")
  if (temp){
    rows_with_unknown = append(rows_with_unknown, i)
  }
}
length(rows_with_unknown)

## [1] 1029

There are 1029 observations with missing data, meaning some observations have multiple instances of missing data.

The columns with missing data are: Job, Marital, Education, Default, Housing and Loan. Do we need these variables? We need to look at the variables to see if they’re worth keeping in the dataset.

Exploring Variables with “unknown” entries

job “unknown” evaluation and decision.

table(bank.orig$job)

## 
##        admin.   blue-collar  entrepreneur     housemaid    management 
##          1012           884           148           110           324 
##       retired self-employed      services       student    technician 
##           166           159           393            82           691 
##    unemployed       unknown 
##           111            39

There are 39 “unknown” values here.

CrossTable(bank.orig$job, bank.orig$y)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4119 
## 
##  
##               | bank.orig$y 
## bank.orig$job |        no |       yes | Row Total | 
## --------------|-----------|-----------|-----------|
##        admin. |       879 |       133 |      1012 | 
##               |     0.547 |     4.445 |           | 
##               |     0.869 |     0.131 |     0.246 | 
##               |     0.240 |     0.295 |           | 
##               |     0.213 |     0.032 |           | 
## --------------|-----------|-----------|-----------|
##   blue-collar |       823 |        61 |       884 | 
##               |     1.627 |    13.235 |           | 
##               |     0.931 |     0.069 |     0.215 | 
##               |     0.224 |     0.135 |           | 
##               |     0.200 |     0.015 |           | 
## --------------|-----------|-----------|-----------|
##  entrepreneur |       140 |         8 |       148 | 
##               |     0.511 |     4.154 |           | 
##               |     0.946 |     0.054 |     0.036 | 
##               |     0.038 |     0.018 |           | 
##               |     0.034 |     0.002 |           | 
## --------------|-----------|-----------|-----------|
##     housemaid |        99 |        11 |       110 | 
##               |     0.011 |     0.091 |           | 
##               |     0.900 |     0.100 |     0.027 | 
##               |     0.027 |     0.024 |           | 
##               |     0.024 |     0.003 |           | 
## --------------|-----------|-----------|-----------|
##    management |       294 |        30 |       324 | 
##               |     0.104 |     0.845 |           | 
##               |     0.907 |     0.093 |     0.079 | 
##               |     0.080 |     0.067 |           | 
##               |     0.071 |     0.007 |           | 
## --------------|-----------|-----------|-----------|
##       retired |       128 |        38 |       166 | 
##               |     2.659 |    21.622 |           | 
##               |     0.771 |     0.229 |     0.040 | 
##               |     0.035 |     0.084 |           | 
##               |     0.031 |     0.009 |           | 
## --------------|-----------|-----------|-----------|
## self-employed |       146 |        13 |       159 | 
##               |     0.137 |     1.117 |           | 
##               |     0.918 |     0.082 |     0.039 | 
##               |     0.040 |     0.029 |           | 
##               |     0.035 |     0.003 |           | 
## --------------|-----------|-----------|-----------|
##      services |       358 |        35 |       393 | 
##               |     0.184 |     1.499 |           | 
##               |     0.911 |     0.089 |     0.095 | 
##               |     0.098 |     0.078 |           | 
##               |     0.087 |     0.008 |           | 
## --------------|-----------|-----------|-----------|
##       student |        63 |        19 |        82 | 
##               |     1.375 |    11.186 |           | 
##               |     0.768 |     0.232 |     0.020 | 
##               |     0.017 |     0.042 |           | 
##               |     0.015 |     0.005 |           | 
## --------------|-----------|-----------|-----------|
##    technician |       611 |        80 |       691 | 
##               |     0.031 |     0.249 |           | 
##               |     0.884 |     0.116 |     0.168 | 
##               |     0.167 |     0.177 |           | 
##               |     0.148 |     0.019 |           | 
## --------------|-----------|-----------|-----------|
##    unemployed |        92 |        19 |       111 | 
##               |     0.474 |     3.857 |           | 
##               |     0.829 |     0.171 |     0.027 | 
##               |     0.025 |     0.042 |           | 
##               |     0.022 |     0.005 |           | 
## --------------|-----------|-----------|-----------|
##       unknown |        35 |         4 |        39 | 
##               |     0.002 |     0.017 |           | 
##               |     0.897 |     0.103 |     0.009 | 
##               |     0.010 |     0.009 |           | 
##               |     0.008 |     0.001 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |      3668 |       451 |      4119 | 
##               |     0.891 |     0.109 |           | 
## --------------|-----------|-----------|-----------|
## 
##

Highest y values come from retirees (22.9%) and students (23.2%). Unknown constitute 39 out of the population, and there was only 10.3% y=yes conversions. Recommend ignoring these observations.

I create a duplicate of the original bank dataset, just in case we want to do something else with these unknowns.

duplicate.bank<-bank.orig

Unknowns in job are now filtered out of the duplicate dataset.

duplicate.bank<-duplicate.bank %>% filter(job!="unknown")

marital “unknown” evaluation and decision

table(duplicate.bank$marital)

## 
## divorced  married   single  unknown 
##      444     2481     1144       11

There are only 11 unknown observations here.

CrossTable(duplicate.bank$marital, duplicate.bank$y)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4080 
## 
##  
##                        | duplicate.bank$y 
## duplicate.bank$marital |        no |       yes | Row Total | 
## -----------------------|-----------|-----------|-----------|
##               divorced |       401 |        43 |       444 | 
##                        |     0.081 |     0.655 |           | 
##                        |     0.903 |     0.097 |     0.109 | 
##                        |     0.110 |     0.096 |           | 
##                        |     0.098 |     0.011 |           | 
## -----------------------|-----------|-----------|-----------|
##                married |      2233 |       248 |      2481 | 
##                        |     0.257 |     2.087 |           | 
##                        |     0.900 |     0.100 |     0.608 | 
##                        |     0.615 |     0.555 |           | 
##                        |     0.547 |     0.061 |           | 
## -----------------------|-----------|-----------|-----------|
##                 single |       989 |       155 |      1144 | 
##                        |     0.864 |     7.021 |           | 
##                        |     0.865 |     0.135 |     0.280 | 
##                        |     0.272 |     0.347 |           | 
##                        |     0.242 |     0.038 |           | 
## -----------------------|-----------|-----------|-----------|
##                unknown |        10 |         1 |        11 | 
##                        |     0.004 |     0.035 |           | 
##                        |     0.909 |     0.091 |     0.003 | 
##                        |     0.003 |     0.002 |           | 
##                        |     0.002 |     0.000 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |      3633 |       447 |      4080 | 
##                        |     0.890 |     0.110 |           | 
## -----------------------|-----------|-----------|-----------|
## 
##

Single, married, divorced or unknown, none of these really breaks out when it comes to y=yes conversion. Highest is “single” with 13.5%, followed by married at 10%. There are only 11 observations in the “unknown” category, so I think it is safe to remove these observations from the dataset.

duplicate.bank<-duplicate.bank %>% filter(marital!="unknown")

education “unknown” evaluation and decision

table(duplicate.bank$education)

## 
##            basic.4y            basic.6y            basic.9y         high.school 
##                 418                 227                 568                 912 
##          illiterate professional.course   university.degree             unknown 
##                   1                 533                1256                 154

There’s 167 in the unknown group.

CrossTable(duplicate.bank$education, duplicate.bank$y)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4069 
## 
##  
##                          | duplicate.bank$y 
## duplicate.bank$education |        no |       yes | Row Total | 
## -------------------------|-----------|-----------|-----------|
##                 basic.4y |       381 |        37 |       418 | 
##                          |     0.209 |     1.697 |           | 
##                          |     0.911 |     0.089 |     0.103 | 
##                          |     0.105 |     0.083 |           | 
##                          |     0.094 |     0.009 |           | 
## -------------------------|-----------|-----------|-----------|
##                 basic.6y |       210 |        17 |       227 | 
##                          |     0.307 |     2.496 |           | 
##                          |     0.925 |     0.075 |     0.056 | 
##                          |     0.058 |     0.038 |           | 
##                          |     0.052 |     0.004 |           | 
## -------------------------|-----------|-----------|-----------|
##                 basic.9y |       526 |        42 |       568 | 
##                          |     0.811 |     6.592 |           | 
##                          |     0.926 |     0.074 |     0.140 | 
##                          |     0.145 |     0.094 |           | 
##                          |     0.129 |     0.010 |           | 
## -------------------------|-----------|-----------|-----------|
##              high.school |       815 |        97 |       912 | 
##                          |     0.011 |     0.088 |           | 
##                          |     0.894 |     0.106 |     0.224 | 
##                          |     0.225 |     0.217 |           | 
##                          |     0.200 |     0.024 |           | 
## -------------------------|-----------|-----------|-----------|
##               illiterate |         1 |         0 |         1 | 
##                          |     0.013 |     0.110 |           | 
##                          |     1.000 |     0.000 |     0.000 | 
##                          |     0.000 |     0.000 |           | 
##                          |     0.000 |     0.000 |           | 
## -------------------------|-----------|-----------|-----------|
##      professional.course |       468 |        65 |       533 | 
##                          |     0.091 |     0.741 |           | 
##                          |     0.878 |     0.122 |     0.131 | 
##                          |     0.129 |     0.146 |           | 
##                          |     0.115 |     0.016 |           | 
## -------------------------|-----------|-----------|-----------|
##        university.degree |      1092 |       164 |      1256 | 
##                          |     0.620 |     5.036 |           | 
##                          |     0.869 |     0.131 |     0.309 | 
##                          |     0.301 |     0.368 |           | 
##                          |     0.268 |     0.040 |           | 
## -------------------------|-----------|-----------|-----------|
##                  unknown |       130 |        24 |       154 | 
##                          |     0.370 |     3.003 |           | 
##                          |     0.844 |     0.156 |     0.038 | 
##                          |     0.036 |     0.054 |           | 
##                          |     0.032 |     0.006 |           | 
## -------------------------|-----------|-----------|-----------|
##             Column Total |      3623 |       446 |      4069 | 
##                          |     0.890 |     0.110 |           | 
## -------------------------|-----------|-----------|-----------|
## 
##

Unknown makes up about 4% of the dataset here, but it does have the highest conversion (15.6%), next to university.degree (13.1%). We might consider rolling unknown into university degree (mode impute), which also has the most observations for education.

duplicate.bank$education<-as.character(duplicate.bank$education)

duplicate.bank$education[duplicate.bank$education=="unknown"]<-"university.degree"

duplicate.bank$education<-as.factor(duplicate.bank$education)

default “unknown” evaluation and decision

Is the client’s credit in default?

table(duplicate.bank$default)

## 
##      no unknown     yes 
##    3284     784       1

I don’t think this predictor is useful. Only one person has said yes, and everyone else either didn’t reply or said no. Recommend ignoring it in the final model.

Housing

Does client have a housing loan?

table(duplicate.bank$housing)

## 
##      no unknown     yes 
##    1816     105    2148

CrossTable(duplicate.bank$housing, duplicate.bank$y)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4069 
## 
##  
##                        | duplicate.bank$y 
## duplicate.bank$housing |        no |       yes | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                     no |      1615 |       201 |      1816 | 
##                        |     0.002 |     0.019 |           | 
##                        |     0.889 |     0.111 |     0.446 | 
##                        |     0.446 |     0.451 |           | 
##                        |     0.397 |     0.049 |           | 
## -----------------------|-----------|-----------|-----------|
##                unknown |        96 |         9 |       105 | 
##                        |     0.067 |     0.547 |           | 
##                        |     0.914 |     0.086 |     0.026 | 
##                        |     0.026 |     0.020 |           | 
##                        |     0.024 |     0.002 |           | 
## -----------------------|-----------|-----------|-----------|
##                    yes |      1912 |       236 |      2148 | 
##                        |     0.000 |     0.001 |           | 
##                        |     0.890 |     0.110 |     0.528 | 
##                        |     0.528 |     0.529 |           | 
##                        |     0.470 |     0.058 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |      3623 |       446 |      4069 | 
##                        |     0.890 |     0.110 |           | 
## -----------------------|-----------|-----------|-----------|
## 
##

There seems to be a 50-50 split between those with and those without a housing loan that subscribed. We can probably ignore. Is there a statistically significant difference between those with the loan and those without, who subscribed? IDRE chart says use a chi-square test.

chisq.test(duplicate.bank$housing, duplicate.bank$y)

## 
##  Pearson's Chi-squared test
## 
## data:  duplicate.bank$housing and duplicate.bank$y
## X-squared = 0.63723, df = 2, p-value = 0.7272

P-value indicates we fail to reject the null hypothesis: there is no difference. Those with housing loans were just as likely to subscribe as those without housing loans. Recommend we IGNORE this variable in final model construction.

loan “unknown” evaluation and decision

table(duplicate.bank$loan)

## 
##      no unknown     yes 
##    3308     105     656

CrossTable(duplicate.bank$loan, duplicate.bank$y)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  4069 
## 
##  
##                     | duplicate.bank$y 
## duplicate.bank$loan |        no |       yes | Row Total | 
## --------------------|-----------|-----------|-----------|
##                  no |      2939 |       369 |      3308 | 
##                     |     0.014 |     0.113 |           | 
##                     |     0.888 |     0.112 |     0.813 | 
##                     |     0.811 |     0.827 |           | 
##                     |     0.722 |     0.091 |           | 
## --------------------|-----------|-----------|-----------|
##             unknown |        96 |         9 |       105 | 
##                     |     0.067 |     0.547 |           | 
##                     |     0.914 |     0.086 |     0.026 | 
##                     |     0.026 |     0.020 |           | 
##                     |     0.024 |     0.002 |           | 
## --------------------|-----------|-----------|-----------|
##                 yes |       588 |        68 |       656 | 
##                     |     0.026 |     0.212 |           | 
##                     |     0.896 |     0.104 |     0.161 | 
##                     |     0.162 |     0.152 |           | 
##                     |     0.145 |     0.017 |           | 
## --------------------|-----------|-----------|-----------|
##        Column Total |      3623 |       446 |      4069 | 
##                     |     0.890 |     0.110 |           | 
## --------------------|-----------|-----------|-----------|
## 
##

This looks close to the housing variable. 11.2% did not subscribe, and 10.4% did. Let’s see if there is a difference using a chi-square test. If the test fails to reject the null hypothesis that there is no difference, we IGNORE this variable in the final model construction.

chisq.test(duplicate.bank$loan, duplicate.bank$y)

## 
##  Pearson's Chi-squared test
## 
## data:  duplicate.bank$loan and duplicate.bank$y
## X-squared = 0.97968, df = 2, p-value = 0.6127

The p-value result of the chi-square test indicates we fail to reject the null hypothesis that there is no difference between those observations with and without loans, when it comes to subscribing. Based upon this information, I intend to IGNORE this variable from the final model construction.

Findings of “unknown” evaluations

There were six variables that had unknown observations: Job, Marital, Education, Default, Housing and Loan. We chose to remove the unknown observations from Job and Marital, and mode-imputed the unknown observations into “university degree” for Education. We chose to not use Default, Housing and Loan variables, because Default did not produce any useful information, and there was no statistically significant difference in subscription behavior for Housing and Loan.

Other variables of interest:

duration: The case study documentation indicates this variable should only be included for benchmark purposes, and should be discarded if the intention is to have a realistic predictive model. Given this, I remove it from the GLM.
previous: This is tricky, because each one depends upon the previous (except for the first). You can’t have a 2nd contact without a first, etc. So for this one, I’m going to turn it into a factor, with three levels (0, 1, +2)

duplicate.bank$previous_factor[duplicate.bank$previous == 0] <- 0
duplicate.bank$previous_factor[duplicate.bank$previous > 0 & duplicate.bank$previous <= 1] <- 1
duplicate.bank$previous_factor[duplicate.bank$previous > 1] <- "+2"
duplicate.bank$previous_factor<-as.factor(duplicate.bank$previous_factor)

Creating a GLM for y_binary

model.null <- glm(y_binary ~ 1, data=duplicate.bank, family = binomial)
model.full <- glm(y_binary ~ . -y-default-duration-housing-loan, data=duplicate.bank, family = binomial)

step.model<-step(model.null, scope = list(upper=model.full),
                 direction = 'both', test='Chisq', trace=F)
summary(step.model)

## 
## Call:
## glm(formula = y_binary ~ nr.employed + poutcome + month + contact + 
##     cons.conf.idx + campaign + previous + age, family = binomial, 
##     data = duplicate.bank)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0580  -0.3825  -0.3313  -0.2549   2.7343  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         51.1813707  4.7461890  10.784  < 2e-16 ***
## nr.employed         -0.0101605  0.0009369 -10.845  < 2e-16 ***
## poutcomenonexistent  0.6940532  0.2631400   2.638 0.008350 ** 
## poutcomesuccess      1.8136196  0.2495317   7.268 3.65e-13 ***
## monthaug            -0.3815923  0.2970226  -1.285 0.198889    
## monthdec             0.4503722  0.5443947   0.827 0.408073    
## monthjul             0.1129072  0.2677612   0.422 0.673265    
## monthjun             0.7346358  0.2622836   2.801 0.005096 ** 
## monthmar             1.3663943  0.3792065   3.603 0.000314 ***
## monthmay            -0.5089205  0.2286437  -2.226 0.026026 *  
## monthnov            -0.4893375  0.2783077  -1.758 0.078703 .  
## monthoct            -0.3576916  0.3699739  -0.967 0.333643    
## monthsep            -0.6187388  0.3784585  -1.635 0.102072    
## contacttelephone    -0.7727179  0.1785241  -4.328 1.50e-05 ***
## cons.conf.idx        0.0393056  0.0137006   2.869 0.004119 ** 
## campaign            -0.0761678  0.0342871  -2.221 0.026319 *  
## previous             0.2296628  0.1471938   1.560 0.118695    
## age                  0.0077028  0.0048589   1.585 0.112902    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2813.3  on 4068  degrees of freedom
## Residual deviance: 2197.5  on 4051  degrees of freedom
## AIC: 2233.5
## 
## Number of Fisher Scoring iterations: 6

Checking a Goodness of Fit Test, with Hosmer Lemeshow Test.

hoslem.test(step.model$y, fitted(step.model), g=10)

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  step.model$y, fitted(step.model)
## X-squared = 4.2907, df = 8, p-value = 0.83

The null hypothesis of the Hosmer Lemeshow test is that the model is adequate and fits the data well. The alternate hypothesis is that the model is not adequate and does not fit the data well. Here we see that the p-value of the Hosmer Lemeshow test indicates we fail to reject the null hypothesis, so we can say that our stepwise logistical model fits the data well.

Let us now turn to diagnostics plots and identify if there are any influential points.

plot(step.model, which = 4)

R has identified 3 points in the Cook’s distance plot, but considering how small they are for Cook’s Distance, we may consider leaving them as-is.

Let us now examine the pseudo-R^2 of the model to see how well the model explains the variation in outcome.

pseudo.r2 <- PseudoR2(step.model, which = c("McFadden", "Nagel", "CoxSnell"))
round(pseudo.r2, 3)

##   McFadden Nagelkerke   CoxSnell 
##      0.219      0.281      0.140

Psuedo R^2 for this model ranges from 28% for Nagel to 14% for CoxSnell.

Model interpretation

As this model was developed using AIC selection criteria, the AIC score is what matters here. In this case, the model’s AIC score is 2234. Because the model used AIC selection criteria, not a p-value selection threshold, we have many variables that are part of the model that are statistically insignificant. It’s all about how low the AIC value can go, not the p-values.

The quarterly indicator of number of employees (nr.employed) is statistically significant with a p-value well below the typical 0.05. The odds of having a successful subscription decrease by e^-0.0102 for every unit increase in the quarterly indicator.

We observe that the odds of having a successful subscription (y = yes) are significant when it comes to the outcome of the previous marketing campaign. The odds of having a successful subscription for when the previous campaign outcome was a success (poutcomesuccess) is e^1.814 times that of the odds of having a successful subscription for the failure outcome, and for when the previous outcome was nonexistent (poutcomenonexistent) was e^0.69 times that of the odds for the failure outcome. Successful previous campaigns were most likely to generate a successful subscription, followed by those who’d never experienced a campaign.

The team observed that what month the client was last contacted was not statistically different for most months of the year. However, the team observed statistically significant results when the last contact occurs during March, May and June. The odds of having a successful subscription when the client is last contacted during March are e^1.366=3.92 times that of the standard month. When last contact occurs in June, the odds of having a successful subscription are e^0.73=2.08 times the standard month. For May, however, we saw that the odds are e^(-0.509)=0.60 times that of the standard month. It is worth noting that no client in the data was last contacted during January or February.

The contact method (whether by telephone or cellular) proved to be statistically significant, and the interpretation indicates the odds of having a successful subscription if contact is by telephone is e^-0.772 times that of contact by cellular. Marketing is better off contacting people by cellular, and not land-line.

The monthly indicator for the consumer confidence index is statistically significant: odds for a successful subscription increase by a factor of e^0.039 for every unit increase in monthly indicator of consumer confidence index. The higher the consumer confidence index, the happer Marketing should be.

The number of contacts performed during a campaign (campaign) is statistically significant. The model indicates the odds of having a successful subscription decrease by a factor of e^-0.076 for every unit increase in contacts during a campaign. This makes intuitive sense, since Marketing knows “Never talk past the sale.” If a person is constantly being contacted (harassed) with offers from the same campaign, eventually they will eventually turn away from the sale.

Examining the accuracy of the logistic regression model.

Let us examine the predictive power of the model we’ve created for the banking client. Given that we do not have a testing population, we will evaluate the GLM’s performance against the existing, original data. First we generate predictions from the GLM using the predict.glm function, and assign this to the Predictions variable in the duplicate banking dataset. Once these predictions are generated, we must convert them from numerical values into binary, with the cut-off being 0.5.

bank.orig$Predictions <- predict.glm(step.model, newdata = bank.orig, type = "response")

bank.orig$Predictions_binary <- ifelse(bank.orig$Predictions >= 0.5, 1, 0)

The Caret package’s Confusion Matrix tool helps us establish model performance values.

#caret::confusionMatrix(as.factor(duplicate.bank$y_binary), as.factor(duplicate.bank$Predictions_binary), positive = "1")
caret::confusionMatrix(as.factor(bank.orig$Predictions_binary), as.factor(bank.orig$y_binary), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3622  348
##          1   46  103
##                                          
##                Accuracy : 0.9043         
##                  95% CI : (0.895, 0.9132)
##     No Information Rate : 0.8905         
##     P-Value [Acc > NIR] : 0.00207        
##                                          
##                   Kappa : 0.3056         
##                                          
##  Mcnemar's Test P-Value : < 2e-16        
##                                          
##             Sensitivity : 0.22838        
##             Specificity : 0.98746        
##          Pos Pred Value : 0.69128        
##          Neg Pred Value : 0.91234        
##              Prevalence : 0.10949        
##          Detection Rate : 0.02501        
##    Detection Prevalence : 0.03617        
##       Balanced Accuracy : 0.60792        
##                                          
##        'Positive' Class : 1              
##

Our GLM, with a cut-off of 0.5, has an overall predictive accuracy of approximately 90.4%, meaning 90.4% of the model’s predictions were correct, though that is not revealing in this situation given the bulk of campaign results ended with no subscription. The precision of our model, or the ratio of the total number correctly classified “yes” responses to the total number of correctly predicted results (yes or no) is 69.1%. Precision is identified as “Pos Pred Value” in the ConfusionMatrix. The recall, given as “Sensitivity” in the confusion matrix, is 22.8%. Recall is the number of correctly predicted “yes” values given the actual “yes” values in the data. The model has a very high Specificity score of 98.7%. Ideally we would like both sensitivity and specificity scores to be high. We might explore adjusting the cut-off threshold to improve these scores, as shown in the script below.

library(ROCR)
#ROC Curve and AUC
pr = predict(step.model, bank.orig, type = "response") # make prediction with last built model
response = bank.orig$y_binary

pred <- prediction(pr, response) #Predicted Probability and True Classification

# area under curve
auc <- round(as.numeric(performance(pred, measure = "auc")@y.values),3)

# some important statistics
false.rates <-performance(pred, "fpr","fnr")
accuracy <-performance(pred, "acc","err")
perf <- performance(pred, "tpr","fpr")

#plotting the ROC curve and computing AUC
plot(perf,colorize = T, main = "ROC Curve")
text(0.5,0.5, paste("AUC:", auc))

# computing threshold for cutoff to best trade off sensitivity and specificity
#first sensitivity
plot(unlist(performance(pred, "sens")@x.values), unlist(performance(pred, "sens")@y.values), 
     type="l", lwd=2, 
     ylab="Sensitivity", xlab="Cutoff", main = paste("Maximized Cutoff\n","AUC: ",auc))

par(new=TRUE) # plot another line in same plot

#second specificity
plot(unlist(performance(pred, "spec")@x.values), unlist(performance(pred, "spec")@y.values), 
     type="l", lwd=2, col='red', ylab="", xlab="")
axis(4, at=seq(0,1,0.2)) #specificity axis labels
mtext("Specificity",side=4, col='red')

#find where the lines intersect
min.diff <-which.min(abs(unlist(performance(pred, "sens")@y.values) - unlist(performance(pred, "spec")@y.values)))
min.x<-unlist(performance(pred, "sens")@x.values)[min.diff]
min.y<-unlist(performance(pred, "spec")@y.values)[min.diff]
optimal <-min.x #this is the optimal points to best trade off sensitivity and specificity

abline(h = min.y, lty = 3)
abline(v = min.x, lty = 3)
text(min.x,0,paste("optimal threshold=",round(optimal,2)), pos = 4)

pr_class = ifelse(pr>round(optimal,2), 1,0) #use the optimal cutoff to classify
caret::confusionMatrix(as.factor(pr_class),as.factor(response), positive = "1") #this function and package auto computes a lot of the metrics

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2631  131
##          1 1037  320
##                                           
##                Accuracy : 0.7164          
##                  95% CI : (0.7024, 0.7302)
##     No Information Rate : 0.8905          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2269          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.70953         
##             Specificity : 0.71728         
##          Pos Pred Value : 0.23581         
##          Neg Pred Value : 0.95257         
##              Prevalence : 0.10949         
##          Detection Rate : 0.07769         
##    Detection Prevalence : 0.32945         
##       Balanced Accuracy : 0.71341         
##                                           
##        'Positive' Class : 1               
##

When we modify the cut-off to optimize both specificity and sensitivity, the performance changes to an overall predictive accuracy of approximately 71.6%. By relaxing the cut-off to 0.07, we’ve optimized specificity (rate of true negatives) to 71.7% and the sensitivity (rate of true positives) to 70.9%.