What’s in this thing? Run Structure evaluation.
#str(bank.orig)
If we do logistic regression, we’ll need to convert y from “yes / no” to “1/0.” We’ll also be doing GLM instead of LM.
bank.orig<-bank.orig %>% mutate(y_binary = ifelse(y == "no", 0, 1))
bank.orig$y_binary<-as.integer(bank.orig$y_binary) #For GLM regression purposes.
#I convert the "chr" into "factors" here. If this causes problems we can comment these out later.
bank.orig$job<-as.factor(bank.orig$job)
bank.orig$marital<-as.factor(bank.orig$marital)
bank.orig$education<-as.factor(bank.orig$education)
bank.orig$default<-as.factor(bank.orig$default)
bank.orig$housing<-as.factor(bank.orig$housing)
bank.orig$loan<-as.factor(bank.orig$loan)
bank.orig$contact<-as.factor(bank.orig$contact)
bank.orig$month<-as.factor(bank.orig$month)
bank.orig$day_of_week<-as.factor(bank.orig$day_of_week)
bank.orig$poutcome<-as.factor(bank.orig$poutcome)
bank.orig$y<-as.factor(bank.orig$y)
OK, there we go. The output “y” variable = has the client subscribed a term deposit? Yes = 1, No = 0. Let’s break down the input variables:
“Other” attributes are:
“Social / economic” context attributes
How many rows and columns does our dataset have?
ncol(bank.orig)
## [1] 22
22 columns, which includes the y_binary factor.
How many rows?
nrow(bank.orig)
## [1] 4119
Quick information on the data.
summary(bank.orig)
## age job marital education
## Min. :18.00 admin. :1012 divorced: 446 university.degree :1264
## 1st Qu.:32.00 blue-collar: 884 married :2509 high.school : 921
## Median :38.00 technician : 691 single :1153 basic.9y : 574
## Mean :40.11 services : 393 unknown : 11 professional.course: 535
## 3rd Qu.:47.00 management : 324 basic.4y : 429
## Max. :88.00 retired : 166 basic.6y : 228
## (Other) : 649 (Other) : 168
## default housing loan contact month
## no :3315 no :1839 no :3349 cellular :2652 may :1378
## unknown: 803 unknown: 105 unknown: 105 telephone:1467 jul : 711
## yes : 1 yes :2175 yes : 665 aug : 636
## jun : 530
## nov : 446
## apr : 215
## (Other): 203
## day_of_week duration campaign pdays previous
## fri:768 Min. : 0.0 Min. : 1.000 Min. : 0.0 Min. :0.0000
## mon:855 1st Qu.: 103.0 1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.0000
## thu:860 Median : 181.0 Median : 2.000 Median :999.0 Median :0.0000
## tue:841 Mean : 256.8 Mean : 2.537 Mean :960.4 Mean :0.1903
## wed:795 3rd Qu.: 317.0 3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.0000
## Max. :3643.0 Max. :35.000 Max. :999.0 Max. :6.0000
##
## poutcome emp.var.rate cons.price.idx cons.conf.idx
## failure : 454 Min. :-3.40000 Min. :92.20 Min. :-50.8
## nonexistent:3523 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7
## success : 142 Median : 1.10000 Median :93.75 Median :-41.8
## Mean : 0.08497 Mean :93.58 Mean :-40.5
## 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4
## Max. : 1.40000 Max. :94.77 Max. :-26.9
##
## euribor3m nr.employed y y_binary
## Min. :0.635 Min. :4964 no :3668 Min. :0.0000
## 1st Qu.:1.334 1st Qu.:5099 yes: 451 1st Qu.:0.0000
## Median :4.857 Median :5191 Median :0.0000
## Mean :3.621 Mean :5166 Mean :0.1095
## 3rd Qu.:4.961 3rd Qu.:5228 3rd Qu.:0.0000
## Max. :5.045 Max. :5228 Max. :1.0000
##
Let’s take a closer look at Y and also pdays (those values make it look like the vast majority of people were never contacted previously).
hist(bank.orig$pdays)
summary(bank.orig$y)
## no yes
## 3668 451
table(bank.orig$pdays)
##
## 0 1 2 3 4 5 6 7 9 10 11 12 13 14 15 16
## 2 3 4 52 14 4 42 10 3 8 1 5 2 1 2 2
## 17 18 19 21 999
## 1 2 1 1 3959
Out of 4119 observations, 3959 have pdays = 999 (people who were never contacted previously).
#This is the % of observations classified as 999 for pdays. It is the vast majority.
sum(bank.orig$pdays == 999) / nrow(bank.orig)
## [1] 0.9611556
The pdays variable has me thinking that the importance in this variable isn’t so much how many days passed by after the client was last contacted, but that the client was re-contacted AT ALL. So for pdays, I’m creating a pdays_binary to indicate whether a client had been recontacted or not after a previous campaign. It’s 0 if the client hadn’t been contacted, and 1 if they had.
bank.orig<-bank.orig %>% mutate(pdays_binary = ifelse(pdays == 999, 0, 1))
bank.orig$pdays_binary<-as.integer(bank.orig$pdays_binary)
And what about y_binary? For GLM, we’ll need the output variable something like 0 or 1 and a factor.
table(bank.orig$y_binary)
##
## 0 1
## 3668 451
This works out to about 11% where a client has subscribed for a term deposit.
#CrossTable is something I saw on Youtube. It's part of the "gmodels" library, and is pretty awesome.
CrossTable(bank.orig$y)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 4119
##
##
## | no | yes |
## |-----------|-----------|
## | 3668 | 451 |
## | 0.891 | 0.109 |
## |-----------|-----------|
##
##
##
##
Is there any missing data we have to deal with?
colSums(is.na(bank.orig)) #Is there anything marked NA?
## age job marital education default
## 0 0 0 0 0
## housing loan contact month day_of_week
## 0 0 0 0 0
## duration campaign pdays previous poutcome
## 0 0 0 0 0
## emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
## 0 0 0 0 0
## y y_binary pdays_binary
## 0 0 0
#No NA fields.
Is there anything that has nothing in it?
colSums(bank.orig==" ") #All clear here.
## age job marital education default
## 0 0 0 0 0
## housing loan contact month day_of_week
## 0 0 0 0 0
## duration campaign pdays previous poutcome
## 0 0 0 0 0
## emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
## 0 0 0 0 0
## y y_binary pdays_binary
## 0 0 0
I did notice some “unknown” entries in the summaries above. What about those?
colSums(bank.orig == "unknown")
## age job marital education default
## 0 39 11 167 803
## housing loan contact month day_of_week
## 105 105 0 0 0
## duration campaign pdays previous poutcome
## 0 0 0 0 0
## emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
## 0 0 0 0 0
## y y_binary pdays_binary
## 0 0 0
There are a total of 1230 instances of missing data in the dataset. That ends up being almost 20% of our data (19.5%).
rows_with_unknown = c()
for (i in seq(4119)){
temp = any(bank.orig[i,] == "unknown")
if (temp){
rows_with_unknown = append(rows_with_unknown, i)
}
}
length(rows_with_unknown)
## [1] 1029
There are 1029 observations with missing data, meaning some observations have multiple instances of missing data.
The columns with missing data are: Job, Marital, Education, Default, Housing and Loan. Do we need these variables? We need to look at the variables to see if they’re worth keeping in the dataset.
table(bank.orig$job)
##
## admin. blue-collar entrepreneur housemaid management
## 1012 884 148 110 324
## retired self-employed services student technician
## 166 159 393 82 691
## unemployed unknown
## 111 39
There are 39 “unknown” values here.
CrossTable(bank.orig$job, bank.orig$y)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 4119
##
##
## | bank.orig$y
## bank.orig$job | no | yes | Row Total |
## --------------|-----------|-----------|-----------|
## admin. | 879 | 133 | 1012 |
## | 0.547 | 4.445 | |
## | 0.869 | 0.131 | 0.246 |
## | 0.240 | 0.295 | |
## | 0.213 | 0.032 | |
## --------------|-----------|-----------|-----------|
## blue-collar | 823 | 61 | 884 |
## | 1.627 | 13.235 | |
## | 0.931 | 0.069 | 0.215 |
## | 0.224 | 0.135 | |
## | 0.200 | 0.015 | |
## --------------|-----------|-----------|-----------|
## entrepreneur | 140 | 8 | 148 |
## | 0.511 | 4.154 | |
## | 0.946 | 0.054 | 0.036 |
## | 0.038 | 0.018 | |
## | 0.034 | 0.002 | |
## --------------|-----------|-----------|-----------|
## housemaid | 99 | 11 | 110 |
## | 0.011 | 0.091 | |
## | 0.900 | 0.100 | 0.027 |
## | 0.027 | 0.024 | |
## | 0.024 | 0.003 | |
## --------------|-----------|-----------|-----------|
## management | 294 | 30 | 324 |
## | 0.104 | 0.845 | |
## | 0.907 | 0.093 | 0.079 |
## | 0.080 | 0.067 | |
## | 0.071 | 0.007 | |
## --------------|-----------|-----------|-----------|
## retired | 128 | 38 | 166 |
## | 2.659 | 21.622 | |
## | 0.771 | 0.229 | 0.040 |
## | 0.035 | 0.084 | |
## | 0.031 | 0.009 | |
## --------------|-----------|-----------|-----------|
## self-employed | 146 | 13 | 159 |
## | 0.137 | 1.117 | |
## | 0.918 | 0.082 | 0.039 |
## | 0.040 | 0.029 | |
## | 0.035 | 0.003 | |
## --------------|-----------|-----------|-----------|
## services | 358 | 35 | 393 |
## | 0.184 | 1.499 | |
## | 0.911 | 0.089 | 0.095 |
## | 0.098 | 0.078 | |
## | 0.087 | 0.008 | |
## --------------|-----------|-----------|-----------|
## student | 63 | 19 | 82 |
## | 1.375 | 11.186 | |
## | 0.768 | 0.232 | 0.020 |
## | 0.017 | 0.042 | |
## | 0.015 | 0.005 | |
## --------------|-----------|-----------|-----------|
## technician | 611 | 80 | 691 |
## | 0.031 | 0.249 | |
## | 0.884 | 0.116 | 0.168 |
## | 0.167 | 0.177 | |
## | 0.148 | 0.019 | |
## --------------|-----------|-----------|-----------|
## unemployed | 92 | 19 | 111 |
## | 0.474 | 3.857 | |
## | 0.829 | 0.171 | 0.027 |
## | 0.025 | 0.042 | |
## | 0.022 | 0.005 | |
## --------------|-----------|-----------|-----------|
## unknown | 35 | 4 | 39 |
## | 0.002 | 0.017 | |
## | 0.897 | 0.103 | 0.009 |
## | 0.010 | 0.009 | |
## | 0.008 | 0.001 | |
## --------------|-----------|-----------|-----------|
## Column Total | 3668 | 451 | 4119 |
## | 0.891 | 0.109 | |
## --------------|-----------|-----------|-----------|
##
##
Highest y values come from retirees (22.9%) and students (23.2%). Unknown constitute 39 out of the population, and there was only 10.3% y=yes conversions. Recommend ignoring these observations.
I create a duplicate of the original bank dataset, just in case we want to do something else with these unknowns.
duplicate.bank<-bank.orig
Unknowns in job are now filtered out of the duplicate dataset.
duplicate.bank<-duplicate.bank %>% filter(job!="unknown")
table(duplicate.bank$marital)
##
## divorced married single unknown
## 444 2481 1144 11
There are only 11 unknown observations here.
CrossTable(duplicate.bank$marital, duplicate.bank$y)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 4080
##
##
## | duplicate.bank$y
## duplicate.bank$marital | no | yes | Row Total |
## -----------------------|-----------|-----------|-----------|
## divorced | 401 | 43 | 444 |
## | 0.081 | 0.655 | |
## | 0.903 | 0.097 | 0.109 |
## | 0.110 | 0.096 | |
## | 0.098 | 0.011 | |
## -----------------------|-----------|-----------|-----------|
## married | 2233 | 248 | 2481 |
## | 0.257 | 2.087 | |
## | 0.900 | 0.100 | 0.608 |
## | 0.615 | 0.555 | |
## | 0.547 | 0.061 | |
## -----------------------|-----------|-----------|-----------|
## single | 989 | 155 | 1144 |
## | 0.864 | 7.021 | |
## | 0.865 | 0.135 | 0.280 |
## | 0.272 | 0.347 | |
## | 0.242 | 0.038 | |
## -----------------------|-----------|-----------|-----------|
## unknown | 10 | 1 | 11 |
## | 0.004 | 0.035 | |
## | 0.909 | 0.091 | 0.003 |
## | 0.003 | 0.002 | |
## | 0.002 | 0.000 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 3633 | 447 | 4080 |
## | 0.890 | 0.110 | |
## -----------------------|-----------|-----------|-----------|
##
##
Single, married, divorced or unknown, none of these really breaks out when it comes to y=yes conversion. Highest is “single” with 13.5%, followed by married at 10%. There are only 11 observations in the “unknown” category, so I think it is safe to remove these observations from the dataset.
duplicate.bank<-duplicate.bank %>% filter(marital!="unknown")
table(duplicate.bank$education)
##
## basic.4y basic.6y basic.9y high.school
## 418 227 568 912
## illiterate professional.course university.degree unknown
## 1 533 1256 154
There’s 167 in the unknown group.
CrossTable(duplicate.bank$education, duplicate.bank$y)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 4069
##
##
## | duplicate.bank$y
## duplicate.bank$education | no | yes | Row Total |
## -------------------------|-----------|-----------|-----------|
## basic.4y | 381 | 37 | 418 |
## | 0.209 | 1.697 | |
## | 0.911 | 0.089 | 0.103 |
## | 0.105 | 0.083 | |
## | 0.094 | 0.009 | |
## -------------------------|-----------|-----------|-----------|
## basic.6y | 210 | 17 | 227 |
## | 0.307 | 2.496 | |
## | 0.925 | 0.075 | 0.056 |
## | 0.058 | 0.038 | |
## | 0.052 | 0.004 | |
## -------------------------|-----------|-----------|-----------|
## basic.9y | 526 | 42 | 568 |
## | 0.811 | 6.592 | |
## | 0.926 | 0.074 | 0.140 |
## | 0.145 | 0.094 | |
## | 0.129 | 0.010 | |
## -------------------------|-----------|-----------|-----------|
## high.school | 815 | 97 | 912 |
## | 0.011 | 0.088 | |
## | 0.894 | 0.106 | 0.224 |
## | 0.225 | 0.217 | |
## | 0.200 | 0.024 | |
## -------------------------|-----------|-----------|-----------|
## illiterate | 1 | 0 | 1 |
## | 0.013 | 0.110 | |
## | 1.000 | 0.000 | 0.000 |
## | 0.000 | 0.000 | |
## | 0.000 | 0.000 | |
## -------------------------|-----------|-----------|-----------|
## professional.course | 468 | 65 | 533 |
## | 0.091 | 0.741 | |
## | 0.878 | 0.122 | 0.131 |
## | 0.129 | 0.146 | |
## | 0.115 | 0.016 | |
## -------------------------|-----------|-----------|-----------|
## university.degree | 1092 | 164 | 1256 |
## | 0.620 | 5.036 | |
## | 0.869 | 0.131 | 0.309 |
## | 0.301 | 0.368 | |
## | 0.268 | 0.040 | |
## -------------------------|-----------|-----------|-----------|
## unknown | 130 | 24 | 154 |
## | 0.370 | 3.003 | |
## | 0.844 | 0.156 | 0.038 |
## | 0.036 | 0.054 | |
## | 0.032 | 0.006 | |
## -------------------------|-----------|-----------|-----------|
## Column Total | 3623 | 446 | 4069 |
## | 0.890 | 0.110 | |
## -------------------------|-----------|-----------|-----------|
##
##
Unknown makes up about 4% of the dataset here, but it does have the highest conversion (15.6%), next to university.degree (13.1%). We might consider rolling unknown into university degree (mode impute), which also has the most observations for education.
duplicate.bank$education<-as.character(duplicate.bank$education)
duplicate.bank$education[duplicate.bank$education=="unknown"]<-"university.degree"
duplicate.bank$education<-as.factor(duplicate.bank$education)
Is the client’s credit in default?
table(duplicate.bank$default)
##
## no unknown yes
## 3284 784 1
I don’t think this predictor is useful. Only one person has said yes, and everyone else either didn’t reply or said no. Recommend ignoring it in the final model.
Does client have a housing loan?
table(duplicate.bank$housing)
##
## no unknown yes
## 1816 105 2148
CrossTable(duplicate.bank$housing, duplicate.bank$y)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 4069
##
##
## | duplicate.bank$y
## duplicate.bank$housing | no | yes | Row Total |
## -----------------------|-----------|-----------|-----------|
## no | 1615 | 201 | 1816 |
## | 0.002 | 0.019 | |
## | 0.889 | 0.111 | 0.446 |
## | 0.446 | 0.451 | |
## | 0.397 | 0.049 | |
## -----------------------|-----------|-----------|-----------|
## unknown | 96 | 9 | 105 |
## | 0.067 | 0.547 | |
## | 0.914 | 0.086 | 0.026 |
## | 0.026 | 0.020 | |
## | 0.024 | 0.002 | |
## -----------------------|-----------|-----------|-----------|
## yes | 1912 | 236 | 2148 |
## | 0.000 | 0.001 | |
## | 0.890 | 0.110 | 0.528 |
## | 0.528 | 0.529 | |
## | 0.470 | 0.058 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 3623 | 446 | 4069 |
## | 0.890 | 0.110 | |
## -----------------------|-----------|-----------|-----------|
##
##
There seems to be a 50-50 split between those with and those without a housing loan that subscribed. We can probably ignore. Is there a statistically significant difference between those with the loan and those without, who subscribed? IDRE chart says use a chi-square test.
chisq.test(duplicate.bank$housing, duplicate.bank$y)
##
## Pearson's Chi-squared test
##
## data: duplicate.bank$housing and duplicate.bank$y
## X-squared = 0.63723, df = 2, p-value = 0.7272
P-value indicates we fail to reject the null hypothesis: there is no difference. Those with housing loans were just as likely to subscribe as those without housing loans. Recommend we IGNORE this variable in final model construction.
table(duplicate.bank$loan)
##
## no unknown yes
## 3308 105 656
CrossTable(duplicate.bank$loan, duplicate.bank$y)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 4069
##
##
## | duplicate.bank$y
## duplicate.bank$loan | no | yes | Row Total |
## --------------------|-----------|-----------|-----------|
## no | 2939 | 369 | 3308 |
## | 0.014 | 0.113 | |
## | 0.888 | 0.112 | 0.813 |
## | 0.811 | 0.827 | |
## | 0.722 | 0.091 | |
## --------------------|-----------|-----------|-----------|
## unknown | 96 | 9 | 105 |
## | 0.067 | 0.547 | |
## | 0.914 | 0.086 | 0.026 |
## | 0.026 | 0.020 | |
## | 0.024 | 0.002 | |
## --------------------|-----------|-----------|-----------|
## yes | 588 | 68 | 656 |
## | 0.026 | 0.212 | |
## | 0.896 | 0.104 | 0.161 |
## | 0.162 | 0.152 | |
## | 0.145 | 0.017 | |
## --------------------|-----------|-----------|-----------|
## Column Total | 3623 | 446 | 4069 |
## | 0.890 | 0.110 | |
## --------------------|-----------|-----------|-----------|
##
##
This looks close to the housing variable. 11.2% did not subscribe, and 10.4% did. Let’s see if there is a difference using a chi-square test. If the test fails to reject the null hypothesis that there is no difference, we IGNORE this variable in the final model construction.
chisq.test(duplicate.bank$loan, duplicate.bank$y)
##
## Pearson's Chi-squared test
##
## data: duplicate.bank$loan and duplicate.bank$y
## X-squared = 0.97968, df = 2, p-value = 0.6127
The p-value result of the chi-square test indicates we fail to reject the null hypothesis that there is no difference between those observations with and without loans, when it comes to subscribing. Based upon this information, I intend to IGNORE this variable from the final model construction.
There were six variables that had unknown observations: Job, Marital, Education, Default, Housing and Loan. We chose to remove the unknown observations from Job and Marital, and mode-imputed the unknown observations into “university degree” for Education. We chose to not use Default, Housing and Loan variables, because Default did not produce any useful information, and there was no statistically significant difference in subscription behavior for Housing and Loan.
duplicate.bank$previous_factor[duplicate.bank$previous == 0] <- 0
duplicate.bank$previous_factor[duplicate.bank$previous > 0 & duplicate.bank$previous <= 1] <- 1
duplicate.bank$previous_factor[duplicate.bank$previous > 1] <- "+2"
duplicate.bank$previous_factor<-as.factor(duplicate.bank$previous_factor)
model.null <- glm(y_binary ~ 1, data=duplicate.bank, family = binomial)
model.full <- glm(y_binary ~ . -y-default-duration-housing-loan, data=duplicate.bank, family = binomial)
step.model<-step(model.null, scope = list(upper=model.full),
direction = 'both', test='Chisq', trace=F)
summary(step.model)
##
## Call:
## glm(formula = y_binary ~ nr.employed + poutcome + month + contact +
## cons.conf.idx + campaign + previous + age, family = binomial,
## data = duplicate.bank)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0580 -0.3825 -0.3313 -0.2549 2.7343
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 51.1813707 4.7461890 10.784 < 2e-16 ***
## nr.employed -0.0101605 0.0009369 -10.845 < 2e-16 ***
## poutcomenonexistent 0.6940532 0.2631400 2.638 0.008350 **
## poutcomesuccess 1.8136196 0.2495317 7.268 3.65e-13 ***
## monthaug -0.3815923 0.2970226 -1.285 0.198889
## monthdec 0.4503722 0.5443947 0.827 0.408073
## monthjul 0.1129072 0.2677612 0.422 0.673265
## monthjun 0.7346358 0.2622836 2.801 0.005096 **
## monthmar 1.3663943 0.3792065 3.603 0.000314 ***
## monthmay -0.5089205 0.2286437 -2.226 0.026026 *
## monthnov -0.4893375 0.2783077 -1.758 0.078703 .
## monthoct -0.3576916 0.3699739 -0.967 0.333643
## monthsep -0.6187388 0.3784585 -1.635 0.102072
## contacttelephone -0.7727179 0.1785241 -4.328 1.50e-05 ***
## cons.conf.idx 0.0393056 0.0137006 2.869 0.004119 **
## campaign -0.0761678 0.0342871 -2.221 0.026319 *
## previous 0.2296628 0.1471938 1.560 0.118695
## age 0.0077028 0.0048589 1.585 0.112902
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2813.3 on 4068 degrees of freedom
## Residual deviance: 2197.5 on 4051 degrees of freedom
## AIC: 2233.5
##
## Number of Fisher Scoring iterations: 6
Checking a Goodness of Fit Test, with Hosmer Lemeshow Test.
hoslem.test(step.model$y, fitted(step.model), g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: step.model$y, fitted(step.model)
## X-squared = 4.2907, df = 8, p-value = 0.83
The null hypothesis of the Hosmer Lemeshow test is that the model is adequate and fits the data well. The alternate hypothesis is that the model is not adequate and does not fit the data well. Here we see that the p-value of the Hosmer Lemeshow test indicates we fail to reject the null hypothesis, so we can say that our stepwise logistical model fits the data well.
Let us now turn to diagnostics plots and identify if there are any influential points.
plot(step.model, which = 4)
R has identified 3 points in the Cook’s distance plot, but considering how small they are for Cook’s Distance, we may consider leaving them as-is.
Let us now examine the pseudo-R^2 of the model to see how well the model explains the variation in outcome.
pseudo.r2 <- PseudoR2(step.model, which = c("McFadden", "Nagel", "CoxSnell"))
round(pseudo.r2, 3)
## McFadden Nagelkerke CoxSnell
## 0.219 0.281 0.140
Psuedo R^2 for this model ranges from 28% for Nagel to 14% for CoxSnell.
As this model was developed using AIC selection criteria, the AIC score is what matters here. In this case, the model’s AIC score is 2234. Because the model used AIC selection criteria, not a p-value selection threshold, we have many variables that are part of the model that are statistically insignificant. It’s all about how low the AIC value can go, not the p-values.
The quarterly indicator of number of employees (nr.employed) is statistically significant with a p-value well below the typical 0.05. The odds of having a successful subscription decrease by e^-0.0102 for every unit increase in the quarterly indicator.
We observe that the odds of having a successful subscription (y = yes) are significant when it comes to the outcome of the previous marketing campaign. The odds of having a successful subscription for when the previous campaign outcome was a success (poutcomesuccess) is e^1.814 times that of the odds of having a successful subscription for the failure outcome, and for when the previous outcome was nonexistent (poutcomenonexistent) was e^0.69 times that of the odds for the failure outcome. Successful previous campaigns were most likely to generate a successful subscription, followed by those who’d never experienced a campaign.
The team observed that what month the client was last contacted was not statistically different for most months of the year. However, the team observed statistically significant results when the last contact occurs during March, May and June. The odds of having a successful subscription when the client is last contacted during March are e^1.366=3.92 times that of the standard month. When last contact occurs in June, the odds of having a successful subscription are e^0.73=2.08 times the standard month. For May, however, we saw that the odds are e^(-0.509)=0.60 times that of the standard month. It is worth noting that no client in the data was last contacted during January or February.
The contact method (whether by telephone or cellular) proved to be statistically significant, and the interpretation indicates the odds of having a successful subscription if contact is by telephone is e^-0.772 times that of contact by cellular. Marketing is better off contacting people by cellular, and not land-line.
The monthly indicator for the consumer confidence index is statistically significant: odds for a successful subscription increase by a factor of e^0.039 for every unit increase in monthly indicator of consumer confidence index. The higher the consumer confidence index, the happer Marketing should be.
The number of contacts performed during a campaign (campaign) is statistically significant. The model indicates the odds of having a successful subscription decrease by a factor of e^-0.076 for every unit increase in contacts during a campaign. This makes intuitive sense, since Marketing knows “Never talk past the sale.” If a person is constantly being contacted (harassed) with offers from the same campaign, eventually they will eventually turn away from the sale.
Let us examine the predictive power of the model we’ve created for the banking client. Given that we do not have a testing population, we will evaluate the GLM’s performance against the existing, original data. First we generate predictions from the GLM using the predict.glm function, and assign this to the Predictions variable in the duplicate banking dataset. Once these predictions are generated, we must convert them from numerical values into binary, with the cut-off being 0.5.
bank.orig$Predictions <- predict.glm(step.model, newdata = bank.orig, type = "response")
bank.orig$Predictions_binary <- ifelse(bank.orig$Predictions >= 0.5, 1, 0)
The Caret package’s Confusion Matrix tool helps us establish model performance values.
#caret::confusionMatrix(as.factor(duplicate.bank$y_binary), as.factor(duplicate.bank$Predictions_binary), positive = "1")
caret::confusionMatrix(as.factor(bank.orig$Predictions_binary), as.factor(bank.orig$y_binary), positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3622 348
## 1 46 103
##
## Accuracy : 0.9043
## 95% CI : (0.895, 0.9132)
## No Information Rate : 0.8905
## P-Value [Acc > NIR] : 0.00207
##
## Kappa : 0.3056
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.22838
## Specificity : 0.98746
## Pos Pred Value : 0.69128
## Neg Pred Value : 0.91234
## Prevalence : 0.10949
## Detection Rate : 0.02501
## Detection Prevalence : 0.03617
## Balanced Accuracy : 0.60792
##
## 'Positive' Class : 1
##
Our GLM, with a cut-off of 0.5, has an overall predictive accuracy of approximately 90.4%, meaning 90.4% of the model’s predictions were correct, though that is not revealing in this situation given the bulk of campaign results ended with no subscription. The precision of our model, or the ratio of the total number correctly classified “yes” responses to the total number of correctly predicted results (yes or no) is 69.1%. Precision is identified as “Pos Pred Value” in the ConfusionMatrix. The recall, given as “Sensitivity” in the confusion matrix, is 22.8%. Recall is the number of correctly predicted “yes” values given the actual “yes” values in the data. The model has a very high Specificity score of 98.7%. Ideally we would like both sensitivity and specificity scores to be high. We might explore adjusting the cut-off threshold to improve these scores, as shown in the script below.
library(ROCR)
#ROC Curve and AUC
pr = predict(step.model, bank.orig, type = "response") # make prediction with last built model
response = bank.orig$y_binary
pred <- prediction(pr, response) #Predicted Probability and True Classification
# area under curve
auc <- round(as.numeric(performance(pred, measure = "auc")@y.values),3)
# some important statistics
false.rates <-performance(pred, "fpr","fnr")
accuracy <-performance(pred, "acc","err")
perf <- performance(pred, "tpr","fpr")
#plotting the ROC curve and computing AUC
plot(perf,colorize = T, main = "ROC Curve")
text(0.5,0.5, paste("AUC:", auc))
# computing threshold for cutoff to best trade off sensitivity and specificity
#first sensitivity
plot(unlist(performance(pred, "sens")@x.values), unlist(performance(pred, "sens")@y.values),
type="l", lwd=2,
ylab="Sensitivity", xlab="Cutoff", main = paste("Maximized Cutoff\n","AUC: ",auc))
par(new=TRUE) # plot another line in same plot
#second specificity
plot(unlist(performance(pred, "spec")@x.values), unlist(performance(pred, "spec")@y.values),
type="l", lwd=2, col='red', ylab="", xlab="")
axis(4, at=seq(0,1,0.2)) #specificity axis labels
mtext("Specificity",side=4, col='red')
#find where the lines intersect
min.diff <-which.min(abs(unlist(performance(pred, "sens")@y.values) - unlist(performance(pred, "spec")@y.values)))
min.x<-unlist(performance(pred, "sens")@x.values)[min.diff]
min.y<-unlist(performance(pred, "spec")@y.values)[min.diff]
optimal <-min.x #this is the optimal points to best trade off sensitivity and specificity
abline(h = min.y, lty = 3)
abline(v = min.x, lty = 3)
text(min.x,0,paste("optimal threshold=",round(optimal,2)), pos = 4)
pr_class = ifelse(pr>round(optimal,2), 1,0) #use the optimal cutoff to classify
caret::confusionMatrix(as.factor(pr_class),as.factor(response), positive = "1") #this function and package auto computes a lot of the metrics
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2631 131
## 1 1037 320
##
## Accuracy : 0.7164
## 95% CI : (0.7024, 0.7302)
## No Information Rate : 0.8905
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2269
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.70953
## Specificity : 0.71728
## Pos Pred Value : 0.23581
## Neg Pred Value : 0.95257
## Prevalence : 0.10949
## Detection Rate : 0.07769
## Detection Prevalence : 0.32945
## Balanced Accuracy : 0.71341
##
## 'Positive' Class : 1
##
When we modify the cut-off to optimize both specificity and sensitivity, the performance changes to an overall predictive accuracy of approximately 71.6%. By relaxing the cut-off to 0.07, we’ve optimized specificity (rate of true negatives) to 71.7% and the sensitivity (rate of true positives) to 70.9%.