The motivation of this study is to create a model regarding weather an individual will take the credit for a home repair tax credit in Emil City. The assignment is completed from the prospective of Emil City’s Department of Housing and Community Development. Typically only 11% of the homeowners take the credit. Recently, the department created a new marketing campaign where they now project that 25% of the homeowners take the credit. A sample of records were given, which were then feature engineered in an attempt to create a more accurate model. The costs of marketing is 2850 dollars per person. However, when someone takes the credit the department makes 5000 dollars. The community also receives a benefit from the credit. Surrounding homes gain a 56000 dollar aggregate premium, while houses that take the credit sell with a 10000 dollar premium. This particular study creates a model to determine weather a specific individual will take the credit or not. A cost analysis of the model will also be performed, analyzing the model’s performance.
options(scipen=10000000)
library(tidyverse)
library(kableExtra)
library(caret)
library(knitr)
library(pscl)
library(plotROC)
library(pROC)
library(lubridate)
library(ggplot2)
library(sf)
library(ggcorrplot)
palette5 <- c("#FFFF00","#FF0000","#FFCC00","#00CCFF")
palette4 <- c("#981FAC","#FF006A","#FE4C35","#FE9900")
palette2 <- c("#FFCC33","#3333FF")
bank <- data.frame(st_read("C:/Users/Kyle McCarthy/Documents/MUSA_PubPolicy/Churn/data.csv"))%>%
mutate(churnNumeric = as.factor(ifelse(y == "yes", 1, 0))) %>%
rename(Churn = y)%>%
mutate(Churn = ifelse(Churn == "yes", "Yes", "No"))
bank<- bank%>%
na.omit()
bank$age <- as.numeric(as.character(bank$age))
bank$campaign <- as.numeric(as.character(bank$campaign))
bank$duration<- as.numeric(as.character(bank$duration))
bank$pdays <- as.numeric(as.character(bank$pdays))
bank$emp.var.rate <- as.numeric(as.character(bank$emp.var.rate))
bank$cons.price.idx <- as.numeric(as.character(bank$cons.price.idx))
bank$cons.conf.idx <- as.numeric(as.character(bank$cons.conf.idx))
bank$euribor3m <- as.numeric(as.character(bank$euribor3m))
bank$nr.employed <- as.numeric(as.character(bank$nr.employed))
bank$duration <- as.numeric(as.character(bank$duration))
bank$previous <- as.numeric(as.character(bank$previous))
bank %>%
dplyr::select(Churn,age, campaign, pdays, cons.price.idx, previous,
nr.employed, euribor3m, cons.conf.idx, emp.var.rate, duration)%>%
gather(Variable, value, -Churn) %>%
ggplot() +
geom_bar(position = "dodge", stat = "summary", fun = "mean",
aes(Churn, value, fill=Churn), show.legend = FALSE) +
facet_wrap(~Variable, scales = "free") +
scale_fill_manual(values = palette2) +
labs(x="Churn", y="Mean",
title = "Feature Associations with the Likelihood of Churn",
subtitle = "(continous outcomes)") +
theme_classic()
bank %>%
dplyr::select(Churn,age, campaign, pdays, cons.price.idx, previous,
nr.employed, euribor3m, cons.conf.idx, emp.var.rate, duration) %>%
gather(Variable, value, -Churn) %>%
ggplot() +
geom_density(aes(value, color=Churn), fill = "transparent") +
facet_wrap(~Variable, scales = "free") +
scale_fill_manual(values = palette2) +
labs(title = "Feature distributions Take Credit vs. Not Take Credit",
subtitle = "(Continous Outcomes)") +
theme_classic()
bank %>%
dplyr::select(Churn, job, marital, education, housing, loan, contact,
poutcome, day_of_week, month) %>%
gather(Variable, value, -Churn) %>%
count(Variable, value, Churn) %>%
ggplot(., aes(value, n, fill = Churn)) +
geom_bar(position = "dodge", stat="identity") +
facet_wrap(~Variable, scales="free") +
scale_fill_manual(values = palette2) +
labs(x="Click", y="Value",
title = "Feature Associations with Likelihood of Taking the Credit",
subtitle = "Categorical features") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
These were the variables that were feature engineeredand used the models. There were many other variables that were feature engineered, but the results did not fit the model. Many variables were eliminated from the kitchen sink approach to avoid overfitting to improve the specificity value associated with the kitchen sink model shown below.
Features used in the Feature engineering include:
If the emp.var.rate was greater than .25
If nr.employed is less than 5088 while duration was greater than 13
If duration is greater than 10 while the emp.var.rate is less than -2.5
Selected months
If euribor4m is less than 0.73 and duration is greater or eqal to 5
If cons.conf.idx is in a specific range while duration greater than 12
Many other features were engineered, but they showed either no impact on the model, or decreased the model’s accuracy.
bank1 <- bank
bank1 <- mutate(bank1, nr.employed1 = ifelse(nr.employed < 5088, 1, 0))
bank1 <- mutate(bank1, emp.var.rate1= ifelse(emp.var.rate >= -2.5, 1, 0))
bank1 <- mutate(bank1, employ_d = ifelse(duration > 13 & nr.employed < 5088, 1, 0))
bank1 <- mutate(bank1, var_rate_d = ifelse(duration > 10 & emp.var.rate < -2.5, 1, 0))
bank1 <- mutate(bank1, month1 = ifelse(month != 'may' & month != 'april'& month != 'dec' & month != 'oct', 1, 2))
bank1 <- mutate(bank1, euribor_d = ifelse(euribor3m < 0.73 & duration >= 5, 1, 2))
bank1 <- mutate(bank1, cons.conf_d = ifelse(((cons.conf.idx >-43.5 & cons.conf.idx < -42.5) | (cons.conf.idx >-38.5 & cons.conf.idx < -36.5)) & duration > 12, "good", "bad"))
bank1 <- mutate(bank1, campaign1 = ifelse(campaign > 3.5,1, 0))
bank1 <- mutate(bank1, job1 = ifelse((job == "admin." | job == "blue-collar") & (duration >= 15), 0, 1))
bank1 <- mutate(bank1, loan.housing = ifelse(loan == "yes" & housing == "yes" & nr.employed > 5088, 1, 0))
In logistic regression, the generalized linear model (glm) predicts the probablity of an observation such as churn or no churn. In this case we are predicting the probability that someone takes the credit (churn), or does not take the credit (no churn). Logistic Regression is based off of maxium liklihood estimation. Logistic regression fits an S-shaped curve. The predicted probabilities are run on a continutous scale from 0 to 100. By splitting the data up into training groups, (65/35), the model predicts the probability of churning or not churning on new data.
# No feature engineering
set.seed(2021)
trainIndex <- createDataPartition(bank$Churn, p = .65,
list = FALSE,
times = 1)
churnTrain <- bank[trainIndex,]
churnTest <- bank[-trainIndex,]
churnreg <- glm(churnNumeric ~ .,
data=churnTrain %>% dplyr::select(churnNumeric,
emp.var.rate, euribor3m,
poutcome, duration,
poutcome, job, age, marital, education, loan, contact, month, day_of_week, duration, campaign, pdays, poutcome, emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, nr.employed),
family="binomial" (link="logit"))
summary(churnreg)
##
## Call:
## glm(formula = churnNumeric ~ ., family = binomial(link = "logit"),
## data = churnTrain %>% dplyr::select(churnNumeric, emp.var.rate,
## euribor3m, poutcome, duration, poutcome, job, age, marital,
## education, loan, contact, month, day_of_week, duration,
## campaign, pdays, poutcome, emp.var.rate, cons.price.idx,
## cons.conf.idx, euribor3m, nr.employed))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.8799 -0.2902 -0.1779 -0.1121 2.8846
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -169.1403216 149.7861606 -1.129
## emp.var.rate -1.0316964 0.5606150 -1.840
## euribor3m -0.0750228 0.5046292 -0.149
## poutcomenonexistent 0.5170649 0.2677907 1.931
## poutcomesuccess 0.9506521 0.7268259 1.308
## duration 0.0049253 0.0003229 15.255
## jobblue-collar -0.0594121 0.3356408 -0.177
## jobentrepreneur -0.7032684 0.6113842 -1.150
## jobhousemaid 0.4566215 0.5366936 0.851
## jobmanagement -0.0005860 0.3420963 -0.002
## jobretired -0.1188436 0.4238817 -0.280
## jobself-employed -0.7109119 0.5943402 -1.196
## jobservices 0.2299972 0.3629005 0.634
## jobstudent 0.1855712 0.4695863 0.395
## jobtechnician 0.4469018 0.2688729 1.662
## jobunemployed 0.2746721 0.5188685 0.529
## jobunknown 0.2796204 0.8493150 0.329
## age 0.0194539 0.0101407 1.918
## maritalmarried 0.2456663 0.2981532 0.824
## maritalsingle 0.4441035 0.3400248 1.306
## maritalunknown -12.2721455 579.2129206 -0.021
## educationbasic.6y 0.3955129 0.4981984 0.794
## educationbasic.9y -0.1428717 0.4132811 -0.346
## educationhigh.school 0.1508787 0.3769949 0.400
## educationilliterate -13.3845633 1455.3976844 -0.009
## educationprofessional.course -0.1023334 0.4126173 -0.248
## educationuniversity.degree 0.2507544 0.3798069 0.660
## educationunknown 0.3172213 0.4853152 0.654
## loanunknown -0.5450535 0.6264772 -0.870
## loanyes 0.0902652 0.2181773 0.414
## contacttelephone -0.8446281 0.3473231 -2.432
## monthaug 0.6065789 0.5045071 1.202
## monthdec 0.8918619 0.7879639 1.132
## monthjul -0.0753617 0.4513295 -0.167
## monthjun 0.6859769 0.5367836 1.278
## monthmar 2.3087201 0.6165788 3.744
## monthmay -0.2689469 0.3748905 -0.717
## monthnov -0.2220857 0.5115717 -0.434
## monthoct 0.6099208 0.6454712 0.945
## monthsep 0.5168296 0.7282198 0.710
## day_of_weekmon 0.0247056 0.2667384 0.093
## day_of_weekthu 0.1779404 0.2633603 0.676
## day_of_weektue -0.0656022 0.2647024 -0.248
## day_of_weekwed 0.2757072 0.2750566 1.002
## campaign -0.1283403 0.0595887 -2.154
## pdays -0.0007640 0.0007171 -1.065
## cons.price.idx 1.6325325 0.9872520 1.654
## cons.conf.idx 0.0623301 0.0328321 1.898
## nr.employed 0.0027433 0.0121071 0.227
## Pr(>|z|)
## (Intercept) 0.258808
## emp.var.rate 0.065725 .
## euribor3m 0.881815
## poutcomenonexistent 0.053501 .
## poutcomesuccess 0.190890
## duration < 0.0000000000000002 ***
## jobblue-collar 0.859500
## jobentrepreneur 0.250025
## jobhousemaid 0.394878
## jobmanagement 0.998633
## jobretired 0.779194
## jobself-employed 0.231643
## jobservices 0.526228
## jobstudent 0.692710
## jobtechnician 0.096487 .
## jobunemployed 0.596551
## jobunknown 0.741981
## age 0.055060 .
## maritalmarried 0.409962
## maritalsingle 0.191521
## maritalunknown 0.983096
## educationbasic.6y 0.427262
## educationbasic.9y 0.729567
## educationhigh.school 0.688999
## educationilliterate 0.992662
## educationprofessional.course 0.804126
## educationuniversity.degree 0.509116
## educationunknown 0.513344
## loanunknown 0.384284
## loanyes 0.679076
## contacttelephone 0.015023 *
## monthaug 0.229240
## monthdec 0.257695
## monthjul 0.867388
## monthjun 0.201271
## monthmar 0.000181 ***
## monthmay 0.473126
## monthnov 0.664198
## monthoct 0.344698
## monthsep 0.477880
## day_of_weekmon 0.926204
## day_of_weekthu 0.499261
## day_of_weektue 0.804263
## day_of_weekwed 0.316167
## campaign 0.031258 *
## pdays 0.286720
## cons.price.idx 0.098206 .
## cons.conf.idx 0.057636 .
## nr.employed 0.820747
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1853.7 on 2678 degrees of freedom
## Residual deviance: 1057.3 on 2630 degrees of freedom
## AIC: 1155.3
##
## Number of Fisher Scoring iterations: 14
# Feature Engineering
set.seed(2021)
trainIndex <- createDataPartition(bank1$Churn, p = .65,
list = FALSE,
times = 1)
churnTrain1 <- bank1[trainIndex,]
churnTest1 <- bank1[-trainIndex,]
churnreg1 <- glm(churnNumeric ~ .,
data=churnTrain1 %>% dplyr::select(churnNumeric, emp.var.rate1, employ_d, var_rate_d, euribor_d, cons.conf_d, month1, duration, nr.employed1, euribor3m, poutcome, age),
family="binomial" (link="logit"))
summary(churnreg1)
##
## Call:
## glm(formula = churnNumeric ~ ., family = binomial(link = "logit"),
## data = churnTrain1 %>% dplyr::select(churnNumeric, emp.var.rate1,
## employ_d, var_rate_d, euribor_d, cons.conf_d, month1,
## duration, nr.employed1, euribor3m, poutcome, age))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.4373 -0.2973 -0.1903 -0.1462 3.0886
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.8602109 0.9020867 -2.062 0.03920 *
## emp.var.rate1 0.1875677 0.2769393 0.677 0.49822
## employ_d 12.9434461 358.8992649 0.036 0.97123
## var_rate_d NA NA NA NA
## euribor_d -0.8210564 0.2900274 -2.831 0.00464 **
## cons.conf_dgood 0.1119244 0.2486430 0.450 0.65261
## month1 -0.4716921 0.1993886 -2.366 0.01800 *
## duration 0.0046593 0.0003055 15.253 < 0.0000000000000002 ***
## nr.employed1 -11.1900207 358.8992393 -0.031 0.97513
## euribor3m -0.3799690 0.0704656 -5.392 0.0000000696 ***
## poutcomenonexistent 0.4194479 0.2449788 1.712 0.08686 .
## poutcomesuccess 1.4711669 0.3198722 4.599 0.0000042405 ***
## age 0.0096445 0.0065876 1.464 0.14318
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1853.7 on 2678 degrees of freedom
## Residual deviance: 1106.1 on 2667 degrees of freedom
## AIC: 1130.1
##
## Number of Fisher Scoring iterations: 13
The odds ratio is a measure of exposure and outcome. The value represents the value the odds a outcome will occur compared to the odds an outcome will occur given the incident exposure.
# No Feature Engineering
co1 <- churnreg$coefficients
odd_ratios1 <- exp(churnreg$coefficients)
kable(co1, caption = "Regreesion Coeficients for Model Variables - Kitchen Sink") %>%
kable_classic(font_size = 12)
| x | |
|---|---|
| (Intercept) | -169.1403216 |
| emp.var.rate | -1.0316964 |
| euribor3m | -0.0750228 |
| poutcomenonexistent | 0.5170649 |
| poutcomesuccess | 0.9506521 |
| duration | 0.0049253 |
| jobblue-collar | -0.0594121 |
| jobentrepreneur | -0.7032684 |
| jobhousemaid | 0.4566215 |
| jobmanagement | -0.0005860 |
| jobretired | -0.1188436 |
| jobself-employed | -0.7109119 |
| jobservices | 0.2299972 |
| jobstudent | 0.1855712 |
| jobtechnician | 0.4469018 |
| jobunemployed | 0.2746721 |
| jobunknown | 0.2796204 |
| age | 0.0194539 |
| maritalmarried | 0.2456663 |
| maritalsingle | 0.4441035 |
| maritalunknown | -12.2721455 |
| educationbasic.6y | 0.3955129 |
| educationbasic.9y | -0.1428717 |
| educationhigh.school | 0.1508787 |
| educationilliterate | -13.3845633 |
| educationprofessional.course | -0.1023334 |
| educationuniversity.degree | 0.2507544 |
| educationunknown | 0.3172213 |
| loanunknown | -0.5450535 |
| loanyes | 0.0902652 |
| contacttelephone | -0.8446281 |
| monthaug | 0.6065789 |
| monthdec | 0.8918619 |
| monthjul | -0.0753617 |
| monthjun | 0.6859769 |
| monthmar | 2.3087201 |
| monthmay | -0.2689469 |
| monthnov | -0.2220857 |
| monthoct | 0.6099208 |
| monthsep | 0.5168296 |
| day_of_weekmon | 0.0247056 |
| day_of_weekthu | 0.1779404 |
| day_of_weektue | -0.0656022 |
| day_of_weekwed | 0.2757072 |
| campaign | -0.1283403 |
| pdays | -0.0007640 |
| cons.price.idx | 1.6325325 |
| cons.conf.idx | 0.0623301 |
| nr.employed | 0.0027433 |
kable(odd_ratios1, caption = "Odd Ratios(x) for Model Variables - Kitchen Sink") %>%
kable_classic(font_size = 12)
| x | |
|---|---|
| (Intercept) | 0.0000000 |
| emp.var.rate | 0.3564018 |
| euribor3m | 0.9277224 |
| poutcomenonexistent | 1.6770979 |
| poutcomesuccess | 2.5873964 |
| duration | 1.0049375 |
| jobblue-collar | 0.9423183 |
| jobentrepreneur | 0.4949649 |
| jobhousemaid | 1.5787312 |
| jobmanagement | 0.9994142 |
| jobretired | 0.8879466 |
| jobself-employed | 0.4911961 |
| jobservices | 1.2585965 |
| jobstudent | 1.2039060 |
| jobtechnician | 1.5634608 |
| jobunemployed | 1.3160990 |
| jobunknown | 1.3226277 |
| age | 1.0196444 |
| maritalmarried | 1.2784729 |
| maritalsingle | 1.5590919 |
| maritalunknown | 0.0000047 |
| educationbasic.6y | 1.4851457 |
| educationbasic.9y | 0.8668653 |
| educationhigh.school | 1.1628556 |
| educationilliterate | 0.0000015 |
| educationprofessional.course | 0.9027285 |
| educationuniversity.degree | 1.2849944 |
| educationunknown | 1.3733064 |
| loanunknown | 0.5798108 |
| loanyes | 1.0944645 |
| contacttelephone | 0.4297171 |
| monthaug | 1.8341458 |
| monthdec | 2.4396678 |
| monthjul | 0.9274080 |
| monthjun | 1.9857107 |
| monthmar | 10.0615387 |
| monthmay | 0.7641838 |
| monthnov | 0.8008467 |
| monthoct | 1.8402857 |
| monthsep | 1.6767034 |
| day_of_weekmon | 1.0250133 |
| day_of_weekthu | 1.1947541 |
| day_of_weektue | 0.9365034 |
| day_of_weekwed | 1.3174620 |
| campaign | 0.8795540 |
| pdays | 0.9992363 |
| cons.price.idx | 5.1168168 |
| cons.conf.idx | 1.0643136 |
| nr.employed | 1.0027470 |
# Feature Engineering
co2 <- churnreg1$coefficients
odds_ratios <- exp(churnreg1$coefficients)
kable(co2, caption = "Regression Coeficients for Model Variables - Kitchen Sink") %>%
kable_classic(font_size = 12)
| x | |
|---|---|
| (Intercept) | -1.8602109 |
| emp.var.rate1 | 0.1875677 |
| employ_d | 12.9434461 |
| var_rate_d | NA |
| euribor_d | -0.8210564 |
| cons.conf_dgood | 0.1119244 |
| month1 | -0.4716921 |
| duration | 0.0046593 |
| nr.employed1 | -11.1900207 |
| euribor3m | -0.3799690 |
| poutcomenonexistent | 0.4194479 |
| poutcomesuccess | 1.4711669 |
| age | 0.0096445 |
kable(odds_ratios, caption = "Odd Ratios(x) for Model Variables- Feature Engineering") %>%
kable_classic(font_size = 12)
| x | |
|---|---|
| (Intercept) | 0.1556398 |
| emp.var.rate1 | 1.2063120 |
| employ_d | 418087.5303426 |
| var_rate_d | NA |
| euribor_d | 0.4399666 |
| cons.conf_dgood | 1.1184283 |
| month1 | 0.6239456 |
| duration | 1.0046701 |
| nr.employed1 | 0.0000138 |
| euribor3m | 0.6838826 |
| poutcomenonexistent | 1.5211214 |
| poutcomesuccess | 4.3543134 |
| age | 1.0096912 |
In Logistic Regression The McFadden R2 value is similar to the R2 value in OLS regression. However, the R2 value cannot be interpreted the same way as the R2 in OLS regression. However, generally speaking, a McFadden value of closer to 1 is typically a better model. Note that the kitchen sink has a higher McFadden values because it incorporates many more variables, not because it is a better model.
LogisticReg1 <- pR2(churnreg)[4]
## fitting null model for pseudo-r2
kable(LogisticReg1, caption = "McFadden Pseudo R2 value Kitchen Sink") %>%
kable_classic_2(font_size = 12)
| x | |
|---|---|
| McFadden | 0.4296426 |
LogisticReg <- pR2(churnreg1)[4]
## fitting null model for pseudo-r2
kable(LogisticReg, caption = "McFadden Pseudo R2 value Feature Engineering") %>%
kable_classic_2(font_size = 12)
| x | |
|---|---|
| McFadden | 0.4033133 |
This section depicts the predicted probabilities for churn or no churn, represented by the values “0” or ‘1’. In a successful model, the ‘0’ graph should cluster close to 0. A clustering closer to 0 on the x-axis indicates a higher probability of predicting a customer does not churn. In contrast, an accurate graph in the ‘yes’ graph should depict a clustering of values closer to 1. In the model relating to weather a customer will take the credit or not, the model predicts well for those who will not take the credit. As shown on the Figure 11, the ‘0’ graph is clustered towards 0 on the x axis. Unfortunately, the model does not preform as well in the bottom ‘Yes’ graph since the distribution of predicted probabilities are dispersed rather than clustered. Feature engineering helped improve the graph, creating a greater density of probabilities closer to 1. However, the graph indicates a large source of error in the model in predicting for people who will take the credit. Note that there is a slight difference between the kitchen sink model and the feature engineered model. The feature engineered model depicts a slightly more accurate model since the values in the no churn graph are more clustered towards 0.
# No Feature Engineering
testProbs <- data.frame(Outcome = as.factor(churnTest$churnNumeric),
Probs = predict(churnreg, churnTest, type= "response"))
head(testProbs)
## Outcome Probs
## 2 0 0.02177699
## 3 0 0.04610825
## 5 0 0.01166591
## 6 0 0.21664051
## 7 0 0.38187866
## 14 0 0.03401087
ggplot(testProbs, aes(x = Probs, fill = as.factor(Outcome))) +
geom_density() +
facet_grid(Outcome ~ .) +
scale_fill_manual(values = palette2) + xlim(0, 1) +
labs(x = "Churn", y = "Density of Probabilities",
title = "Distribution of Predicted Probabilities by Observed Outcome -Kitchen Sink ") +
theme(strip.text.x = element_text(size = 1),
legend.position = "none")
# Feature Engineering
testProbs1 <- data.frame(Outcome = as.factor(churnTest1$churnNumeric),
Probs = predict(churnreg1, churnTest1, type= "response"))
head(testProbs)
## Outcome Probs
## 2 0 0.02177699
## 3 0 0.04610825
## 5 0 0.01166591
## 6 0 0.21664051
## 7 0 0.38187866
## 14 0 0.03401087
ggplot(testProbs1, aes(x = Probs, fill = as.factor(Outcome))) +
geom_density() +
facet_grid(Outcome ~ .) +
scale_fill_manual(values = palette2) + xlim(0, 1) +
labs(x = "Churn", y = "Density of Probabilities",
title = "Distribution of Predicted Probabilities by Observed Outcome - Feature Engineering ") +
theme(strip.text.x = element_text(size = 1),
legend.position = "none")
The ROC curve visualizes trade offs for while also providing a goodness of fit indicator. Following the y-axis to 0.75, and looking at the orange curve, the ROC curve indicates that the model that predicts churn correctly 75 percent of the time, incorrectly predicts churn aproximently 8 percent of the time. As the threshold increases, the rate of false positives also increases. If the ROC is below the 50 percent line the fit is not a useful fit. If the ROC is a right angle, the model is over-fit. In this particular model, the ROC curve depicts a relatively useful fit. The area under the curve is slightly larger for the feature engineered model. The area under the curve for the Kitchen Sink model was .9313, while the area underneath the curve for the feature engineered model was .9391.
# Kitchen Sink
testProbs <-
testProbs %>%
mutate(predOutcome = as.factor(ifelse(testProbs$Probs > 0.5 , 1, 0)))
caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1247 79
## 1 36 78
##
## Accuracy : 0.9201
## 95% CI : (0.9049, 0.9336)
## No Information Rate : 0.891
## P-Value [Acc > NIR] : 0.0001305
##
## Kappa : 0.5328
##
## Mcnemar's Test P-Value : 8.984e-05
##
## Sensitivity : 0.49682
## Specificity : 0.97194
## Pos Pred Value : 0.68421
## Neg Pred Value : 0.94042
## Prevalence : 0.10903
## Detection Rate : 0.05417
## Detection Prevalence : 0.07917
## Balanced Accuracy : 0.73438
##
## 'Positive' Class : 1
##
auc(testProbs$Outcome, testProbs$Probs)
## Area under the curve: 0.9313
ggplot(testProbs, aes(d = as.numeric(testProbs$Outcome), m = Probs)) +
geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") +
style_roc(theme = theme_grey) +
geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
labs(title = "ROC Curve - Credit Model - Kitchen Sink")
# Feature Engineering
testProbs1 <-
testProbs1 %>%
mutate(predOutcome = as.factor(ifelse(testProbs1$Probs > 0.5 , 1, 0)))
caret::confusionMatrix(testProbs1$predOutcome, testProbs1$Outcome,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1250 80
## 1 33 77
##
## Accuracy : 0.9215
## 95% CI : (0.9064, 0.9349)
## No Information Rate : 0.891
## P-Value [Acc > NIR] : 6.238e-05
##
## Kappa : 0.535
##
## Mcnemar's Test P-Value : 1.509e-05
##
## Sensitivity : 0.49045
## Specificity : 0.97428
## Pos Pred Value : 0.70000
## Neg Pred Value : 0.93985
## Prevalence : 0.10903
## Detection Rate : 0.05347
## Detection Prevalence : 0.07639
## Balanced Accuracy : 0.73236
##
## 'Positive' Class : 1
##
auc(testProbs1$Outcome, testProbs1$Probs)
## Area under the curve: 0.9379
ggplot(testProbs1, aes(d = as.numeric(testProbs1$Outcome), m = Probs)) +
geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") +
style_roc(theme = theme_grey) +
geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
labs(title = "ROC Curve - Credit Model - Feature Engineering")
Cross validation is used to divide the data into k number of folds. In this model, the trainControl parameter is set to run 100 - kfolds and output predicted probabilities. Additional parameters output the AUC and the confusion metrics for each fold. The AUC represents the area underneath the ROC curve. Note that this is represented as ‘ROC’ in Figures 14 and 15. The cvFit output are for mean AUC, sensitivity and specificity across 100 folds. The sensitivity graph in figure 15 indicates the rate that the model correctly predicts true positives. The model predicts true positives very well as the data is clustered near the value of 1 on the x axis. In contrast the model does not predict as well regarding specificity. Specificity indicates the rate the model predicts true negatives. Note that this is not representative of a very tight distribution, indicating that the model is not generlizable. Consequently, the model is inconsistent in how it predicts churn. The goal of the feature engineering was to increase the specificity value. The feature engineering increased the specificity value from .415 to .466.
# No Feature Engineering
ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)
cvFit <- train(Churn ~ .,
data=bank %>%
dplyr::select(Churn,emp.var.rate, euribor3m,
poutcome, duration,
poutcome, job, age, marital, education, loan, contact, month, day_of_week, duration, campaign, pdays, poutcome, emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, nr.employed),
method="glm", family="binomial",
metric="ROC", trControl = ctrl)
cvFit
## Generalized Linear Model
##
## 4119 samples
## 17 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 4079, 4078, 4078, 4078, 4077, 4078, ...
## Resampling results:
##
## ROC Sens Spec
## 0.9295398 0.9713739 0.4215
dplyr::select(cvFit$resample, -Resample) %>%
gather(metric, value) %>%
left_join(gather(cvFit$results[2:4], metric, mean)) %>%
ggplot(aes(value)) +
geom_histogram(bins=35, fill = "#FFCC66") +
facet_wrap(~metric) +
geom_vline(aes(xintercept = mean), colour = "#FF0000", linetype = 3, size = 1.5) +
scale_x_continuous(limits = c(0, 1)) +
labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics",
subtitle = "Across-fold mean reprented as dotted lines- Kitchen Sink")
# Feature Engineering
ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)
cvFit1 <- train(Churn ~ .,
data=bank1 %>%
dplyr::select(Churn, emp.var.rate1, employ_d, var_rate_d, euribor_d, cons.conf_d, month1, duration, nr.employed1, euribor3m, poutcome, age),
method="glm", family="binomial",
metric="ROC", trControl = ctrl)
cvFit1
## Generalized Linear Model
##
## 4119 samples
## 11 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 4078, 4078, 4077, 4078, 4079, 4078, ...
## Resampling results:
##
## ROC Sens Spec
## 0.9304767 0.9735811 0.449
dplyr::select(cvFit1$resample, -Resample) %>%
gather(metric, value) %>%
left_join(gather(cvFit1$results[2:4], metric, mean)) %>%
ggplot(aes(value)) +
geom_histogram(bins=35, fill = "#FFCC66") +
facet_wrap(~metric) +
geom_vline(aes(xintercept = mean), colour = "#FF0000", linetype = 3, size = 1.5) +
scale_x_continuous(limits = c(0, 1)) +
labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics",
subtitle = "Across-fold mean reprented as dotted lines")
Figure 9 depicts the confusion matrix. Each row in the confusion matrix is calculates the cost benefit from the feature engineered model. The equations for each equation are as follows:
True Negative = count * 5000
True Positive = (5000 - 2800) * (Count * .25) + ((-2850) * count * .75))
False Negative = (-5000 * Count)
False Positive = (5000 - 2850) * Count
Note that the 10,000 dollar premium received by the home and the $56,000 aggregate premium from the surrounding houses was excluded from this equation. The table is based on the prospective from the Department of Housing and Community Development. The homeowner, not the Department of Housing and Community Development, receives the benefit from the 10,000 dollar premium. The surrounding houses receives the benefit from the 56,000 home value premium. Although there may be additional benefits to the Department of Housing and Community Development, the calculations below are based on the known benefits received from the projected 25 percent of homeowners who participated in the Community Development program credit.
cost_benefit_table <-
testProbs1 %>%
count(predOutcome, Outcome) %>%
summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
True_Positive = sum(n[predOutcome==1 & Outcome==1]),
False_Negative = sum(n[predOutcome==0 & Outcome==1]),
False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
gather(Variable, Count) %>%
mutate(Revenue =
case_when(Variable == "True_Negative" ~ Count * 5000,
Variable == "True_Positive" ~ ((5000 - 2800) * (Count * .25)) + ((-2850) * (Count * .75)),
Variable == "False_Negative" ~ (-5000) * Count,
Variable == "False_Positive" ~ (5000 - 2850) * Count)) %>%
bind_cols(data.frame(Description = c(
"We predicted no churn and did not spend resources",
"We predicted churn and spent resources",
"We predicted no churn and the customer churned",
"We predicted churn and the customer did not churn")))
kable(cost_benefit_table) %>%
kable_styling(font_size = 12, full_width = F,
bootstrap_options = c("striped", "hover", "condensed"))
| Variable | Count | Revenue | Description |
|---|---|---|---|
| True_Negative | 1250 | 6250000.0 | We predicted no churn and did not spend resources |
| True_Positive | 77 | -122237.5 | We predicted churn and spent resources |
| False_Negative | 80 | -400000.0 | We predicted no churn and the customer churned |
| False_Positive | 33 | 70950.0 | We predicted churn and the customer did not churn |
Figure 16: Cost Benefit Table
#This function takes as its inputs, a data frame with an observed binomial class (1 or 0); a vector of predicted probabilities; and optionally a group indicator like race. It returns accuracy plus counts and rates of confusion matrix outcomes. It’s a bit verbose because of the if (missing(group)).
iterateThresholds <- function(data, observedClass, predictedProbs, group) {
observedClass <- enquo(observedClass)
predictedProbs <- enquo(predictedProbs)
group <- enquo(group)
x = .01
all_prediction <- data.frame()
if (missing(group)) {
while (x <= 1) {
this_prediction <- data.frame()
this_prediction <-
data %>%
mutate(predclass = ifelse(!!predictedProbs > x, 1,0)) %>%
count(predclass, !!observedClass) %>%
summarize(Count_TN = sum(n[predclass==0 & !!observedClass==0]),
Count_TP = sum(n[predclass==1 & !!observedClass==1]),
Count_FN = sum(n[predclass==0 & !!observedClass==1]),
Count_FP = sum(n[predclass==1 & !!observedClass==0]),
Rate_TP = Count_TP / (Count_TP + Count_FN),
Rate_FP = Count_FP / (Count_FP + Count_TN),
Rate_FN = Count_FN / (Count_FN + Count_TP),
Rate_TN = Count_TN / (Count_TN + Count_FP),
Accuracy = (Count_TP + Count_TN) /
(Count_TP + Count_TN + Count_FN + Count_FP)) %>%
mutate(Threshold = round(x,2))
all_prediction <- rbind(all_prediction,this_prediction)
x <- x + .01
}
return(all_prediction)
}
else if (!missing(group)) {
while (x <= 1) {
this_prediction <- data.frame()
this_prediction <-
data %>%
mutate(predclass = ifelse(!!predictedProbs > x, 1,0)) %>%
group_by(!!group) %>%
count(predclass, !!observedClass) %>%
summarize(Count_TN = sum(n[predclass==0 & !!observedClass==0]),
Count_TP = sum(n[predclass==1 & !!observedClass==1]),
Count_FN = sum(n[predclass==0 & !!observedClass==1]),
Count_FP = sum(n[predclass==1 & !!observedClass==0]),
Rate_TP = Count_TP / (Count_TP + Count_FN),
Rate_FP = Count_FP / (Count_FP + Count_TN),
Rate_FN = Count_FN / (Count_FN + Count_TP),
Rate_TN = Count_TN / (Count_TN + Count_FP),
Accuracy = (Count_TP + Count_TN) /
(Count_TP + Count_TN + Count_FN + Count_FP)) %>%
mutate(Threshold = round(x,2))
all_prediction <- rbind(all_prediction,this_prediction)
x <- x + .01
}
return(all_prediction)
}
}
This section utilizes the data collected from the feature engineered
whichThreshold <-
iterateThresholds(
data=testProbs, observedClass = Outcome, predictedProbs = Probs)
whichThreshold <-
whichThreshold %>%
dplyr::select(starts_with("Count"), Threshold) %>%
gather(Variable, Count, -Threshold) %>%
mutate(Revenue =
case_when(Variable == "Count_TN" ~ Count * 5000,
Variable == "Count_TP" ~ ((5000 - 2800) * (Count * .25)) +
(-2850 * (Count * .50)),
Variable == "Count_FN" ~ (-5000) * Count,
Variable == "Count_FP" ~ (5000 - 2800) * Count))
whichThreshold %>%
ggplot(.,aes(Threshold, Revenue, colour = Variable)) +
geom_point() +
scale_colour_manual(values = palette5) +
labs(title = "Profit by Confusion Matrix Type and Threshold",
y = "Profit") +
theme_classic() +
guides(colour=guide_legend(title = "Confusion Matrix"))
whichThreshold %>%
ggplot(.,aes(Threshold, Count, colour = Variable)) +
geom_point() +
scale_colour_manual(values = palette5) +
labs(title = "Total Count by Confusion Matrix Type and Threshold",
y = "Profit") +
theme_classic() +
guides(colour=guide_legend(title = "Confusion Matrix"))
which_threshold2 <-
whichThreshold %>%
filter(Threshold == 0.35 | Threshold == 0.5)%>%
group_by(Threshold)%>%
summarise(Total_count = sum(Count), Total_Revenue = sum(Revenue))
kable(which_threshold2, caption = "Comparing Optimal Threshold to 50% Threshold")%>%
kable_styling()
| Threshold | Total_count | Total_Revenue |
|---|---|---|
| 0.35 | 1440 | 5863600 |
| 0.50 | 1440 | 5850950 |
Overall, this model should not be put into production because it is not generalizable. Specifically, the model struggles in the specificity category. Specificity refers to the rate that the model predicts no churn. Despite feature engineering, the specificity did not increase much compared to the kitchen sink model. However, the feature engineering model did perform better than the kitchen sink model. Although this particular model did not make as much of a difference in the specificity, feature engineering has the ability to make a significant impact on the model. Consequently, to improve the model feature engineering should be continued to ensure that the best possible variables are created to increase the specificity, resulting in a better model. Another option is to collect more data and different variables to look into other attributes that may make a difference in the model.
On the positive, the model performs very well in the sensitivity category.Sensitivity is the rate that the model correctly predicts churn. In order to increase specificity, the high sensitivity value may have to be sacrificed in order to create the best model possible. Even though the model is not completely generalizable, the model predicts a profit of 5863600 dollars at a 35% threshold. Although this profit is significant, a more generalizable model with a higher specificity would likely return an even higher profit. The community also benefits from every credit bought. The surrounding homes make an aggregate of 56000 dollars from every credit and each house gets a 10000 dollar premium from the credit when they sell their house. Lastly, the ROC graph shows that the model is a relatively good fit, having an area of .9391. However this high ROC risks overfitting the model.
Finally, to ensure a more improved response rate, the department should advertise to the people most likely to take the credit during the best time periods. The should choose times where homeowners are most receptive to increased investment. According to the attributes, a good time to market is in the months of March or January. They should also look focus on categorical characteristics such as employment, education or gender to look for trends among different demographics.The housing department can utilize additional attributes such as lot size, age of house, house material, and location to look for additional trends in the data. Finally, the department can seek endorsements from real estate and construction companies to help promote the credit.