Julian Hartwell
Penn, MUSA 508 - Public Policy Analytics
Fall 2020
In this assignment, my job was to build a machine learning classifier that reduced revenue loss for the Philadelphia Department of Housing and Community Development (HCD). My role was to gather various client and economic data to build a prediction model on whether someone would use a housing credit to make repairs to their home or not (binary classifier).
This city (which is based on a UC Irvine dataset from a Portugese bank’s marketing campaign) was incentivized to improve the credit voucher acceptance rate. Increased use had positive immediate and neighborhood impacts in raising property values. In most scenarios, the department lost money, however, this is an ancillary concern since the program has great societal benefit.
First, I read in my libraries, ran some custom functions, and brought in the dataset housing which contains 4,119 observations and 27 base features after removing NAs.
knitr::opts_chunk$set(error=FALSE)
options(scipen=10000000)
library(tidyverse)
library(caret)
library(knitr)
library(pscl)
library(plotROC)
library(pROC)
library(scales)
library(kableExtra)
library(MASS)
library(stargazer)
palette5 <- c("#981FAC","#CB0F8B","#FF006A","#FE4C35","#FE9900")
palette4 <- c("#981FAC","#FF006A","#FE4C35","#FE9900")
palette2 <- c("#981FAC","#FF006A")
plotTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle = element_text(face="italic"),
plot.caption = element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.grid.major = element_line("grey80", size = 0.1),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.background = element_rect(fill = "grey80", color = "white"),
strip.text = element_text(size=12),
axis.title = element_text(size=12),
axis.text = element_text(size=10),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic"),
legend.text = element_text(colour = "black", face = "italic"),
strip.text.x = element_text(size = 14)
)
}
housing <- read.csv("housingSubsidy.csv") %>%
na.omit()
The first step was plotting continuous variables in relation to whether a client took a housing voucher. I was especially interested in features that had a large discrepancy between the dependent variable, Took Credit (answered as “yes” or “no”). inflation_rate for example, shows that clients are much more likely to take the housing voucher and make home improvements during periods of low inflation. The previous variable shows that the more times our department contacts someone, the more likley they are to take the voucher. Finally in periods of a negative unemploy_rate, people are more likely to take housing vouchers - this is a proxy for a growing economy.
The second series of plots looks at the same variables on a geom_density curve. I was specifically looking for plots that flipped at different values - this would imply a variable transformation would be beneficial. inflation_rate and unemploy_rate both qualified since the no/yes outcomes flipped at higher values.
housing %>%
dplyr::select(y,age,previous,unemploy_rate,campaign,cons.price.idx,
cons.conf.idx,inflation_rate,spent_on_repairs) %>%
gather(Variable, value, -y) %>%
ggplot(aes(y, value, fill=y)) +
geom_bar(position = "dodge", stat = "summary", fun = "mean") +
facet_wrap(~Variable, scales = "free") +
scale_fill_manual(values = palette2) +
labs(x="Took Credit", y="Mean",
title = "Feature associations with the likelihood of taking home credit",
subtitle = "(Continous outcomes)") +
theme(legend.position = "none")
housing %>%
dplyr::select(y,age,previous,unemploy_rate,cons.price.idx,campaign,
cons.conf.idx,inflation_rate,spent_on_repairs) %>%
gather(Variable, value, -y) %>%
ggplot() +
geom_density(aes(value, color=y), fill = "transparent") +
facet_wrap(~Variable, scales = "free") +
scale_fill_manual(values = palette2) +
labs(title = "Feature distributions took credit v. no credit",
subtitle = "(continous outcomes)") +
plotTheme() +
guides(colour=guide_legend(title = "Took Credit"))
I also looked at the categorical variables. Proportionally there is very little difference between prospects who took the housing credit based on their responses to categorical questions.
For taxbill_in_phl there is a higher response from residents of Philadelphia, yet the rate of housing credits appears even, controlling for population. Prospects contacted by cellphone are more likely to take the credit than households reached by telephone. Marital is interesting because it shows that proportionally, more single households are likely to take the housing credit. The largest sampled group was married.
Education and job both have a lot of categories, therefore I wanted to transform them into an easier factor to understand.
housing %>%
dplyr::select(y, marital,taxLien,mortgage,taxbill_in_phl,contact,poutcome) %>%
gather(Variable, value, -y) %>%
count(Variable, value, y) %>%
ggplot(., aes(value, n, fill = y)) +
geom_bar(position = "dodge", stat="identity") +
facet_wrap(~Variable, scales="free") +
scale_fill_manual(values = palette2) +
labs(x="Took Credit", y="Count",
title = "Feature associations with the likelihood of housing credit",
subtitle = "Categorical features") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
housing %>%
dplyr::select(y,job,education ) %>%
gather(Variable, value, -y) %>%
count(Variable, value, y) %>%
ggplot(., aes(value, n, fill = y)) +
geom_bar(position = "dodge", stat="identity") +
facet_wrap(~Variable, scales="free") +
scale_fill_manual(values = palette2) +
labs(x="Took Credit", y="Count",
title = "Feature associations with the likelihood of housing credit",
subtitle = "Categorical features") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
I transformed education and month into smaller categories. Education is now coded as “high school or less”, “college”, or “unknown”. Month is now split by season. Age showed significant variation across generations and so I split this into <30, between 30 and 50, and greater than 50 years old. Finally I split inflation_rate and unemploy_rate into low and high binary categories.
housing <- housing %>%
mutate(education2 = recode(education, "illiterate" = "HighSchool or Less",
"basic.4y" = "HighSchool or Less",
"basic.6y" = "HighSchool or Less",
"basic.9y" = "HighSchool or Less",
"high.school" = "HighSchool or Less",
"professional.course" = "College",
"university.degree" = "College",
"unknown" = "unknown"),
month2 = recode(month,"dec" = "Winter",
"jan" = "Winter",
"feb" = "Winter",
"mar" = "Spring",
"apr" = "Spring",
"may" = "Spring",
"jun" = "Summer",
"jul" = "Summer",
"aug" = "Summer",
"sep" = "Fall",
"oct" = "Fall",
"nov" = "Fall"),
age2 = case_when(age <= 30 ~ "young",
age > 30 | age <= 50 ~ "middle-aged",
age > 50 ~ "old"),
inflation_rate2 = case_when(inflation_rate <= 3 ~ "low",
inflation_rate > 3 ~ "high"),
unemploy_rate2 = case_when(unemploy_rate <= -0.5 ~ "low",
unemploy_rate >-0.5 ~"high")) %>%
mutate_if(is.character, as.factor)
I split my dataset 65/35 in a train/test split. Then I developed four models, whose output is shown below. Model 1 is my base model that includes all of the off-the-shelf features in this dataset. Model 2 was a subset of Model 1 where I used the MASS package to run a backwards stepwise logistic regression that removed variables sequentially based on AIC. This model showed a significantly lower AIC, and also performs better in cross-validation later on.
Model 3 uses most of the base features and substitutes the new variables I created for their base versions. Model 4 uses a stepwise process to arrive at a subset of Model 3’s features. Each model actually had a worse McFadden’s R2 than the previous. While this is useful information, I was mostly concerned about how these models would perform on out-of-fold data to see if they’re generalizable. I had a feeling that the best-performing model based on McFadden’s R2 here, Model 1, was overfit.
set.seed(3451)
trainIndex <- createDataPartition(housing$y, p = .65,
list = FALSE,
times = 1)
creditTrain <- housing[ trainIndex,]
creditTest <- housing[-trainIndex,]
model1 <- glm(y_numeric ~., data=creditTrain %>%
dplyr::select(-X,-y,-education2,-month2,-age2,-inflation_rate2, -unemploy_rate2),
family=binomial(link="logit"))
summary(model1)
stepAIC(model1,direction="backward")
model2 <- glm(y_numeric ~ taxbill_in_phl + contact + month + pdays + poutcome +
unemploy_rate + cons.price.idx + cons.conf.idx, data = creditTrain, family=binomial(link="logit"))
model3 <- glm(y_numeric ~ ., data=creditTrain %>%
dplyr::select(-X,-y, -education, -month, -age, inflation_rate, -unemploy_rate),
family=binomial(link="logit"))
summary(model3)
stepAIC(model3, direction="backward")
model4 <- glm(y_numeric ~ taxbill_in_phl + contact + campaign + pdays + poutcome +
cons.price.idx + cons.conf.idx + inflation_rate + month2, data=creditTrain,family=binomial(link="logit"))
pR2(model1)[4]
pR2(model2)[4]
pR2(model3)[4]
pR2(model4)[4]
A few takeaways from the statistically significant predictors from the best model, model 1.
Blue collar and management roles are much less likely than administrative employees to take the housing credit. Prospects who were contacted by telephone were less likely to use the credit than those who were contacted by celephone. Most housing credits were taken in December and March, relative to April. Lower levels of unemployment and higher levels of CPI (consumer price index) were associated with higher likelihoods of housing credits.
#full models
stargazer(model1,model2,model3,model4, type = 'html', align=TRUE, single.row=TRUE, se=NULL, digits=1, dep.var.caption = "",title = "Regression Results, Training",
column.labels=c("Base Variables","Stepwise","Feature Engineering","Stepwise")) #this will work once knitted
| y_numeric | ||||
| Base Variables | Stepwise | Feature Engineering | Stepwise | |
| (1) | (2) | (3) | (4) | |
| age | 0.01 (0.01) | |||
| jobblue-collar | -0.6** (0.3) | -0.5* (0.2) | ||
| jobentrepreneur | -0.7 (0.5) | -0.6 (0.5) | ||
| jobhousemaid | -1.2* (0.7) | -1.2* (0.7) | ||
| jobmanagement | -0.7** (0.3) | -0.6* (0.3) | ||
| jobretired | -0.3 (0.4) | 0.1 (0.3) | ||
| jobself-employed | -0.7 (0.4) | -0.5 (0.4) | ||
| jobservices | -0.2 (0.3) | -0.2 (0.3) | ||
| jobstudent | -0.3 (0.4) | -0.1 (0.4) | ||
| jobtechnician | -0.1 (0.2) | -0.01 (0.2) | ||
| jobunemployed | 0.1 (0.4) | 0.3 (0.4) | ||
| jobunknown | -0.5 (0.8) | -0.3 (0.7) | ||
| maritalmarried | 0.4 (0.3) | 0.3 (0.3) | ||
| maritalsingle | 0.5 (0.3) | 0.4 (0.3) | ||
| maritalunknown | 1.1 (1.1) | 1.0 (1.1) | ||
| educationbasic.6y | 0.1 (0.4) | |||
| educationbasic.9y | 0.2 (0.3) | |||
| educationhigh.school | -0.1 (0.3) | |||
| educationprofessional.course | -0.05 (0.4) | |||
| educationuniversity.degree | -0.1 (0.3) | |||
| educationunknown | 0.1 (0.4) | |||
| taxLienunknown | 0.2 (0.2) | 0.1 (0.2) | ||
| taxLienyes | -9.2 (324.7) | -9.4 (324.7) | ||
| mortgageunknown | -0.01 (0.5) | -0.004 (0.5) | ||
| mortgageyes | -0.2 (0.1) | -0.2 (0.1) | ||
| taxbill_in_phlyes | 0.1 (0.2) | 0.1 (0.2) | 0.03 (0.2) | 0.1 (0.2) |
| contacttelephone | -1.0*** (0.3) | -0.8*** (0.3) | -0.6** (0.2) | -0.6*** (0.2) |
| monthaug | 0.3 (0.5) | 0.1 (0.4) | ||
| monthdec | 1.6** (0.7) | 1.1* (0.6) | ||
| monthjul | 0.2 (0.4) | 0.2 (0.4) | ||
| monthjun | -0.3 (0.5) | 0.1 (0.3) | ||
| monthmar | 2.2*** (0.6) | 1.8*** (0.5) | ||
| monthmay | 0.01 (0.3) | -0.2 (0.3) | ||
| monthnov | -0.3 (0.5) | -0.5 (0.4) | ||
| monthoct | 0.3 (0.6) | 0.1 (0.5) | ||
| monthsep | 0.5 (0.7) | 0.04 (0.5) | ||
| day_of_weekmon | -0.1 (0.2) | -0.1 (0.2) | ||
| day_of_weekthu | 0.01 (0.2) | -0.01 (0.2) | ||
| day_of_weektue | -0.2 (0.2) | -0.2 (0.2) | ||
| day_of_weekwed | 0.01 (0.2) | -0.1 (0.2) | ||
| campaign | -0.1 (0.04) | -0.1 (0.04) | -0.1 (0.04) | |
| pdays | -0.000 (0.001) | -0.000 (0.001) | -0.000 (0.001) | -0.001 (0.001) |
| previous | 0.2 (0.2) | 0.2 (0.2) | ||
| poutcomenonexistent | 0.7* (0.4) | 0.4* (0.2) | 0.6* (0.3) | 0.4* (0.2) |
| poutcomesuccess | 1.8** (0.7) | 1.6** (0.6) | 1.8** (0.7) | 1.4** (0.6) |
| unemploy_rate | -1.4*** (0.5) | -0.8*** (0.1) | ||
| cons.price.idx | 2.5*** (0.9) | 1.4*** (0.2) | 0.8* (0.5) | 0.8*** (0.2) |
| cons.conf.idx | 0.1** (0.03) | 0.05*** (0.02) | 0.1* (0.03) | 0.1*** (0.02) |
| inflation_rate | -0.2 (0.5) | -0.2 (0.5) | -0.6*** (0.1) | |
| spent_on_repairs | 0.01 (0.01) | 0.001 (0.01) | ||
| education2College | 0.02 (0.2) | |||
| education2unknown | 0.2 (0.3) | |||
| month2Summer | 0.4 (0.2) | 0.4* (0.2) | ||
| month2Winter | 1.7*** (0.6) | 1.4** (0.6) | ||
| month2Fall | 0.3 (0.3) | 0.1 (0.2) | ||
| age2young | -0.3 (0.2) | |||
| inflation_rate2low | 1.8 (1.7) | |||
| unemploy_rate2low | ||||
| Constant | -309.3** (131.5) | -126.7*** (17.7) | -79.7 (78.9) | -73.0*** (14.1) |
| Observations | 2,679 | 2,679 | 2,679 | 2,679 |
| Log Likelihood | -708.4 | -723.0 | -725.2 | -737.9 |
| Akaike Inf. Crit. | 1,518.9 | 1,482.0 | 1,532.5 | 1,501.8 |
| Note: | p<0.1; p<0.05; p<0.01 | |||
Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
To evaluate model performance on out-of-fold samples, I ran these same models through a k-fold cross validation with 100 folds, evaluated by ROC. The partial models (2 & 4) showed much higher ROCs, Sensitivity, and Specificity rates. I show visual goodness of fit plots below for the partial models.
ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)
cvFit1 <- train(y ~ ., data = housing %>%
dplyr::select(-X,-y_numeric,-education2,-month2,-age2,-inflation_rate2, -unemploy_rate2),
method="glm", family="binomial",
metric="ROC", trControl = ctrl)
cvFit1
cvFit2 <- train(y ~ taxbill_in_phl + contact + month + pdays + poutcome +
unemploy_rate + cons.price.idx + cons.conf.idx, data = housing,
method="glm", family="binomial",
metric="ROC", trControl = ctrl)
cvFit2
cvFit3 <- train(y ~ ., data = housing %>%
dplyr::select(-X,-y_numeric, -education, -month, -age, inflation_rate, -unemploy_rate),
method="glm", family="binomial",
metric="ROC", trControl = ctrl)
cvFit3
cvFit4 <- train(y ~ taxbill_in_phl + contact + campaign + pdays + poutcome +
cons.price.idx + cons.conf.idx + inflation_rate + month2, data = housing,
method="glm", family="binomial",
metric="ROC", trControl = ctrl)
cvFit4
While the means (purple vertical lines) are important here, I was looking for folds to cluster around these means, since it implies generalizability. There is very little difference between model 2 and model 4. The ROC shows some variations across folds. Sensitivity, or the rate of true positives, is very high and clustered around the mean, implying that these models are good at identifying clients who will take the credit. The specificity is much lower and has higher variation across folds, meaning that the models are worse at capturing true negatives, aka when we correctly predict someone won’t take the housing credit.
dplyr::select(cvFit2$resample, -Resample) %>%
gather(metric, value) %>%
left_join(gather(cvFit2$results[2:4], metric, mean)) %>%
ggplot(aes(value)) +
geom_histogram(bins=35, fill = "#FF006A") +
facet_wrap(~metric) +
geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
scale_x_continuous(limits = c(0, 1)) +
labs(x="Goodness of Fit", y="Count", title="Model 2 Stepwise: CV Goodness of Fit Metrics",
subtitle = "Across-fold mean reprented as dotted lines") +
plotTheme()
dplyr::select(cvFit4$resample, -Resample) %>%
gather(metric, value) %>%
left_join(gather(cvFit4$results[2:4], metric, mean)) %>%
ggplot(aes(value)) +
geom_histogram(bins=35, fill = "#FF006A") +
facet_wrap(~metric) +
geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
scale_x_continuous(limits = c(0, 1)) +
labs(x="Goodness of Fit", y="Count", title="Model 4 Stepwise: CV Goodness of Fit Metrics",
subtitle = "Across-fold mean reprented as dotted lines") +
plotTheme()
I decided to make model 4 my final model since it had the highest ROC, Sensitivity, and Specificity rates. This model was the stepwise version of my feature-engineered model. I created a new dataframe with the out-of-fold test set that compares actual outcomes with predicted probabilities. I decided to make the threshold for yes (to recieving a credit) at 0.5 or greater. The ROC curve is below.
testProbs <- data.frame(Outcome = as.factor(creditTest$y_numeric),
Probs = predict(model4, creditTest, type= "response"))
testProbs <-
testProbs %>%
mutate(predOutcome = as.factor(ifelse(testProbs$Probs > 0.5 , 1, 0)))
head(testProbs)
ggplot(testProbs, aes(d = as.numeric(testProbs$Outcome), m = Probs)) +
geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") +
style_roc(theme = theme_grey) +
geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
labs(title = "ROC Curve - Stepwise Model 4")
Here is an overview of the cost benefit matrix I decided on. Since this was a public use case example, the revenues and expenses are tricky. Home improvements increased property values by $10k, while allocating marketing resources costed the department $2.85k. The use case implied that even for correct predictions of housing credits, only 25% would be used.
True Positive - Predicted correctly homeowner would take the credit; allocated the marketing resources, and 25% took the credit. $10k-$2.85k = $7150 for 25% cases. -$2850 for 75% cases.
True Negative - Predicted correctly homeowner would not take the credit, no marketing resources were allocated, and no credit was allocated. $0
False Positive - Predicted incorrectly homeowner would take the credit; allocated marketing resources; no credit allocated. -$2850
False Negative - We predicted that a homeowner would not take the credit but they did. These are likely homeowners who signed up for reasons unrelated to the marketing campaign. $0
cost_benefit_table <-
testProbs %>%
count(predOutcome, Outcome) %>%
summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
True_Positive = sum(n[predOutcome==1 & Outcome==1]),
False_Negative = sum(n[predOutcome==0 & Outcome==1]),
False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
gather(Variable, Count) %>%
mutate(Revenue =
case_when(Variable == "True_Negative" ~ Count * 0,
Variable == "True_Positive" ~ ((7150) * (Count * .25)) +
(-2850 * (Count * .75)),
Variable == "False_Negative" ~ (0) * Count,
Variable == "False_Positive" ~ (-2850) * Count)) %>%
bind_cols(data.frame(Description = c(
"We correctly predicted no credit and did not allocate marketing resources",
"We correctly predicted taking the credit and allocated marketing resources ",
"We incorrectly predicted no credit and didn't allocate marketing resources",
"We incorrectly predicted taking the credit and allocated marketing resources"))) %>%
kable(caption = "Cost Benefit Table") %>%
kable_styling("striped", full_width = F)
cost_benefit_table
| Variable | Count | Revenue | Description |
|---|---|---|---|
| True_Negative | 1263 | 0 | We correctly predicted no credit and did not allocate marketing resources |
| True_Positive | 40 | -14000 | We correctly predicted taking the credit and allocated marketing resources |
| False_Negative | 117 | 0 | We incorrectly predicted no credit and didn’t allocate marketing resources |
| False_Positive | 20 | -57000 | We incorrectly predicted taking the credit and allocated marketing resources |
Next I build a while loop to find the optimal threshold for when a housing credit is taken. Obvioulsy it should be something greater than the arbitrary 0.5 that I used earlier. Higher thresholds reduce marketing costs, but also may increase our rate of true negatives from clients who were on the fence.
iterateThresholds <- function(data, observedClass, predictedProbs, group) {
#This function takes as its inputs, a data frame with an observed binomial class (1 or 0); a vector of predicted probabilities; and optionally a group indicator like race. It returns accuracy plus counts and rates of confusion matrix outcomes. It's a bit verbose because of the if (missing(group)). I don't know another way to make an optional parameter.
observedClass <- enquo(observedClass)
predictedProbs <- enquo(predictedProbs)
group <- enquo(group)
x = .01
all_prediction <- data.frame()
if (missing(group)) {
while (x <= 1) {
this_prediction <- data.frame()
this_prediction <-
data %>%
mutate(predclass = ifelse(!!predictedProbs > x, 1,0)) %>%
count(predclass, !!observedClass) %>%
summarize(Count_TN = sum(n[predclass==0 & !!observedClass==0]),
Count_TP = sum(n[predclass==1 & !!observedClass==1]),
Count_FN = sum(n[predclass==0 & !!observedClass==1]),
Count_FP = sum(n[predclass==1 & !!observedClass==0]),
Rate_TP = Count_TP / (Count_TP + Count_FN),
Rate_FP = Count_FP / (Count_FP + Count_TN),
Rate_FN = Count_FN / (Count_FN + Count_TP),
Rate_TN = Count_TN / (Count_TN + Count_FP),
Accuracy = (Count_TP + Count_TN) /
(Count_TP + Count_TN + Count_FN + Count_FP)) %>%
mutate(Threshold = round(x,2))
all_prediction <- rbind(all_prediction,this_prediction)
x <- x + .01
}
return(all_prediction)
}
else if (!missing(group)) {
while (x <= 1) {
this_prediction <- data.frame()
this_prediction <-
data %>%
mutate(predclass = ifelse(!!predictedProbs > x, 1,0)) %>%
group_by(!!group) %>%
count(predclass, !!observedClass) %>%
summarize(Count_TN = sum(n[predclass==0 & !!observedClass==0]),
Count_TP = sum(n[predclass==1 & !!observedClass==1]),
Count_FN = sum(n[predclass==0 & !!observedClass==1]),
Count_FP = sum(n[predclass==1 & !!observedClass==0]),
Rate_TP = Count_TP / (Count_TP + Count_FN),
Rate_FP = Count_FP / (Count_FP + Count_TN),
Rate_FN = Count_FN / (Count_FN + Count_TP),
Rate_TN = Count_TN / (Count_TN + Count_FP),
Accuracy = (Count_TP + Count_TN) /
(Count_TP + Count_TN + Count_FN + Count_FP)) %>%
mutate(Threshold = round(x,2))
all_prediction <- rbind(all_prediction,this_prediction)
x <- x + .01
}
return(all_prediction)
}
}
##Applying to our dataset
whichThreshold <-
iterateThresholds(
data=testProbs, observedClass = Outcome, predictedProbs = Probs)
whichThreshold[1:5,]
whichThreshold <-
whichThreshold %>%
dplyr::select(starts_with("Count"), Threshold) %>%
gather(Variable, Count, -Threshold) %>%
mutate(Revenue =
case_when(Variable == "Count_TN" ~ Count * 0,
Variable == "Count_TP" ~ ((7150) * (Count * .25)) +
(-2850 * (Count * .75)),
Variable == "Count_FN" ~ (0) * Count,
Variable == "Count_FP" ~ (-2850) * Count))
The plot below visualizes revenue based on increasing thresholds levels. This plot shows that all metrics have lower revenue loss at higher thresholds. The most costly decision is with false positives (in purple). These are situations where we predict a client will take the credit and allocate $2.85k in marketing resources. We want to be very sure that a client will take the credit, and this requires a threshold above 0.75, at the very least. The last note on this plot is that in almost all scenarios, the department is losing money, and actually, it’s impossible given this cost/benefit analysis for the HCD to “make money”. Is this a problem? Not necessarily, since there are greater societal benefits at play in restoring homes for the city than pure profit and loss.
whichThreshold %>%
ggplot(.,aes(Threshold, Revenue, colour = Variable)) +
geom_point() +
scale_colour_manual(values = palette5[c(5, 1:3)]) +
labs(title = "Revenue by confusion matrix type and threshold",
y = "Revenue") +
plotTheme() +
guides(colour=guide_legend(title = "Confusion Matrix"))
The chart below shows that revenue levels off at $0 when we categorize positives only as yes if they are above a 0.90 threshold.
whichThreshold_revenue <-
whichThreshold %>%
mutate(actualCredit = ifelse(Variable == "Count_TP", (Count * .25),
ifelse(Variable == "Count_FN", Count, 0))) %>%
group_by(Threshold) %>%
summarize(Revenue = sum(Revenue),
Actual_Credit_Rate = sum(actualCredit) / sum(Count),
Actual_Credit_Revenue_Loss = sum(actualCredit * 2850))
whichThreshold_revenue %>%
dplyr::select(Threshold, Revenue) %>%
gather(Variable, Value, -Threshold) %>%
ggplot(aes(Threshold, Value, colour = Variable)) +
geom_point() +
geom_vline(xintercept = pull(arrange(whichThreshold_revenue, -Revenue)[1,1])) +
scale_colour_manual(values = palette2) +
plotTheme() +
labs(title = "Revenue by threshold",
subtitle = "Vertical line denotes optimal threshold")
This chart was also done to find the maximum amount of housing credits taken. This also levels off at the 0.90 threshold with 157 credits taken. For simplicity I assumed that 100% of true positives would take the credit.
whichThreshold_credit <-
whichThreshold %>%
mutate(actualCredit = ifelse(Variable == "Count_TP", (Count * .25),
ifelse(Variable == "Count_FN", Count, 0))) %>%
group_by(Threshold) %>%
summarize(Credit_Count = sum(actualCredit))
whichThreshold_credit %>%
dplyr::select(Threshold, Credit_Count) %>%
gather(Variable, Value, -Threshold) %>%
ggplot(aes(Threshold, Value, colour = Variable)) +
geom_point() +
geom_vline(xintercept = pull(arrange(whichThreshold_credit, -Credit_Count)[1,1])) +
scale_colour_manual(values = palette2) +
plotTheme() +
labs(title = "Actual Credits by threshold",
subtitle = "Vertical line denotes optimal threshold")
This is now shown in tabular format below.
#Optimal is the same - 0.90
#pull(arrange(whichThreshold_revenue, -Revenue)[1,1])
#pull(arrange(whichThreshold_credit, -Credit_Count)[1,1])
whichThreshold_revenue %>%
merge(.,whichThreshold_credit) %>%
filter(Threshold == 0.90 | Threshold == 0.50) %>%
dplyr::select(Threshold,Revenue,Credit_Count) %>%
as.data.frame() %>%
mutate(ID = c("50% Threshold","Optimal Threshold")) %>%
tibble::column_to_rownames('ID') %>%
kable(caption = "Threshold based on Optimal Revenue and Total Credits Taken") %>%
kable_styling("striped", full_width = F)
| Threshold | Revenue | Credit_Count | |
|---|---|---|---|
| 50% Threshold | 0.5 | -71000 | 127 |
| Optimal Threshold | 0.9 | 0 | 157 |
The best model was based on only the following features: taxbill_in_phl + contact + campaign + pdays + poutcome + cons.price.idx + cons.conf.idx + inflation_rate + month2. All of these variables were included in the base set, except for month2 which I adjusted to be one of four seasons.
To improve this model I would need better feature engineering of variables. However I think the ROC, Sensitivity, and Specificity were all pretty good. To maximize revenue and credits taken, the model should only code positives if there’s a 0.90 threshold or higher likelihood. This model definitely improves on the previous marketing reach-out efforts therefore it should be put into production. Variables like pdays and poutcome show the importance of reaching out frequently and the efficacy of prior marketing campaigns. I would definitely advocate for more marketing efforts to educate homeowners about the credit, whether this is in person or over the phone.