Emil City is considering a more proactive approach for targeting home owners who qualify for a home repair tax credit program. Based on a dataset from previous years, we implement a logistic regression to predict program enrollment and assess whether the use of a predictive model can improve the net revenue produced by the program. We compare the base dataset to a dataset including several engineered features. We find that, in both cases, the net revenue is a significant improvement over the current process of random outreach. We further find that feature engineering can improve the performance of the model and inform future outreach. Although we note several ways in which the model could be improved, we recommend that it be put into production given that even a moderately successful preditive model offers a substantial improvement over random outreach.
Show the code
library(tidyverse)library(caret)library(ggthemr)library(pscl)library(plotROC)library(pROC)library(scales)library(rstatix)library(ggpubr)library(kableExtra)library(crosstable)library(ggcorrplot)library(crosstable)library(rsample)library(gridExtra)library(scales)source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")source("hw4_fns.R")options(scipen =999)ggthemr('pale')housing_subsidy_path <-"https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter6/housingSubsidy.csv"housing_subsidy_data <-read.csv(housing_subsidy_path)# log transform previous, pdays, and campagin# collapse classes to avoid resampling issues in train/test split (only one observation of "illiterate" and "yes")housing_subsidy_data <- housing_subsidy_data %>%mutate(education =case_when( education =="basic.4y"~"basic.4y or less", education =="illiterate"~"basic.4y or less",TRUE~ education ),taxLien =case_when( taxLien =="yes"~"unknown/yes", taxLien =="unknown"~"unknown/yes",TRUE~ taxLien ))
2 Motivation
Emil City is considering a more proactive approach for targeting home owners who qualify for a home repair tax credit program. This tax credit program has been around for close to twenty years, and while the Department of Housing and Community Development (HCD) tries to proactively reach out to eligible homeowners ever year, the uptake of the credit is woefully inadequate. Typically not all eligible homeowners they reach out to who enter the program ultimately take the credit.
The consensus at HCD is that the low conversion rate is due to the fact that the agency reaches out to eligible homeowners at random. To move toward a more targeted campaign, we attempt to use a binomial logistic regression to assess whether client-level data collected from previous campaigns can drive decision-making to better target limited outreach sources and approve the success rates in getting homeowners into the home repair tax credit program.
3 Methods
3.1 Data
For this analysis, we use historic data collected by Emil City. To start, we assess the relationship of the various potential predictors to our dependent variable, y, and the relationship of the predictors to each other. We do this using a correlation plot, t-tests (for continuous variables), and chi-squared tests (for categorical variables). We are looking for variables that 1) have a statistically significant relationship with our dependent variable, y, and 2) do not exhibit strong multicolinearity with each other. We will first run a logistic regression on our full dataset and then, using these analytic tools, attempt to eliminate unhelpful features and/or engineer more helpful ones to create the most parsimonous model which maximizes the cost-benefits tradeoffs of the outreach approach. Some minor data cleaning is required: we find that the singular occurrence of “illiterate” in the education column in our base dataset makes it impossible to resample properly for our model, so we group it in with the “basic.4y” category as “basic.4y or less”. Likewise, we group the singular occurence of “yes” in the taxLien column in with the “unknown” category as “unknown/yes”.
3.1.1 Multicolinearity
Using a correlation plot, we can assess which predictors are multicolinear and therefore should be removed from our model. For our purposes, we define strong multicolinearity as anything with an r-value of greater than 0.9 or less than -0.9, which signals a redundancy in including both predictors (in other words, adding the second one does not help improve the predictive power of the model). In the case of multicollinearity, we usually leave one predictor and take out the rest that are strongly correlated with that one predictor. Here, we only drop two predictors: inflation rate and unemployment rate.
Using t-tests, we can assess whether continuous predictors differ across the two classes of the dependent variable in a statistically significant way. If they do, they are useful predictors for our model, but if they do not, they do not contribute meaningfully to the model. Based on these t-tests, we find that one of our continuous variables, X, is shows no statistically significant difference across classes and therefore can be discarded.
3.1.3 Chi-Squared Tests for Categorical Predictors
Similarly, we can use chi-squared tests to assess whether categorical predictors show statistically signficant differences across the classes of the dependent variable. Based on these tests, we drop four predictors: “mortgage”, “taxbill_in_phl”, “day_of_week”, and “education”.
p value: <0.0001 (Fisher's Exact Test for Count Data with simulated p-value (based on 100000 replicates))
blue-collar
823 (93%/22%)
61 (7%/14%)
884 (21%)
entrepreneur
140 (95%/4%)
8 (5%/2%)
148 (4%)
housemaid
99 (90%/3%)
11 (10%/2%)
110 (3%)
management
294 (91%/8%)
30 (9%/7%)
324 (8%)
retired
128 (77%/3%)
38 (23%/8%)
166 (4%)
self-employed
146 (92%/4%)
13 (8%/3%)
159 (4%)
services
358 (91%/10%)
35 (9%/8%)
393 (10%)
student
63 (77%/2%)
19 (23%/4%)
82 (2%)
technician
611 (88%/17%)
80 (12%/18%)
691 (17%)
unemployed
92 (83%/3%)
19 (17%/4%)
111 (3%)
unknown
35 (90%/1%)
4 (10%/1%)
39 (1%)
Total
3668 (89%)
451 (11%)
4119 (100%)
marital
divorced
403 (90%/11%)
43 (10%/10%)
446 (11%)
p value: 0.0165 (Fisher's Exact Test for Count Data)
married
2257 (90%/62%)
252 (10%/56%)
2509 (61%)
single
998 (87%/27%)
155 (13%/34%)
1153 (28%)
unknown
10 (91%/0.3%)
1 (9%/0.2%)
11 (0.3%)
Total
3668 (89%)
451 (11%)
4119 (100%)
education
basic.4y or less
392 (91%/11%)
38 (9%/8%)
430 (10%)
p value: 0.0011 (Pearson's Chi-squared test)
basic.6y
211 (93%/6%)
17 (7%/4%)
228 (6%)
basic.9y
531 (93%/14%)
43 (7%/10%)
574 (14%)
high.school
824 (89%/22%)
97 (11%/22%)
921 (22%)
professional.course
470 (88%/13%)
65 (12%/14%)
535 (13%)
university.degree
1099 (87%/30%)
165 (13%/37%)
1264 (31%)
unknown
141 (84%/4%)
26 (16%/6%)
167 (4%)
Total
3668 (89%)
451 (11%)
4119 (100%)
taxLien
no
2913 (88%/79%)
402 (12%/89%)
3315 (80%)
p value: <0.0001 (Pearson's Chi-squared test)
unknown/yes
755 (94%/21%)
49 (6%/11%)
804 (20%)
Total
3668 (89%)
451 (11%)
4119 (100%)
mortgage
no
1637 (89%/45%)
202 (11%/45%)
1839 (45%)
p value: 0.7307 (Pearson's Chi-squared test)
unknown
96 (91%/3%)
9 (9%/2%)
105 (3%)
yes
1935 (89%/53%)
240 (11%/53%)
2175 (53%)
Total
3668 (89%)
451 (11%)
4119 (100%)
taxbill_in_phl
no
693 (90%/19%)
77 (10%/17%)
770 (19%)
p value: 0.3495 (Pearson's Chi-squared test)
yes
2975 (89%/81%)
374 (11%/83%)
3349 (81%)
Total
3668 (89%)
451 (11%)
4119 (100%)
contact
cellular
2277 (86%/62%)
375 (14%/83%)
2652 (64%)
p value: <0.0001 (Pearson's Chi-squared test)
telephone
1391 (95%/38%)
76 (5%/17%)
1467 (36%)
Total
3668 (89%)
451 (11%)
4119 (100%)
month
apr
179 (83%/5%)
36 (17%/8%)
215 (5%)
p value: <0.0001 (Fisher's Exact Test for Count Data with simulated p-value (based on 100000 replicates))
aug
572 (90%/16%)
64 (10%/14%)
636 (15%)
dec
10 (45%/0.3%)
12 (55%/3%)
22 (1%)
jul
652 (92%/18%)
59 (8%/13%)
711 (17%)
jun
462 (87%/13%)
68 (13%/15%)
530 (13%)
mar
20 (42%/1%)
28 (58%/6%)
48 (1%)
may
1288 (93%/35%)
90 (7%/20%)
1378 (33%)
nov
403 (90%/11%)
43 (10%/10%)
446 (11%)
oct
44 (64%/1%)
25 (36%/6%)
69 (2%)
sep
38 (59%/1%)
26 (41%/6%)
64 (2%)
Total
3668 (89%)
451 (11%)
4119 (100%)
day_of_week
fri
685 (89%/19%)
83 (11%/18%)
768 (19%)
p value: 0.9723 (Pearson's Chi-squared test)
mon
757 (89%/21%)
98 (11%/22%)
855 (21%)
thu
764 (89%/21%)
96 (11%/21%)
860 (21%)
tue
750 (89%/20%)
91 (11%/20%)
841 (20%)
wed
712 (90%/19%)
83 (10%/18%)
795 (19%)
Total
3668 (89%)
451 (11%)
4119 (100%)
poutcome
failure
387 (85%/11%)
67 (15%/15%)
454 (11%)
p value: <0.0001 (Pearson's Chi-squared test)
nonexistent
3231 (92%/88%)
292 (8%/65%)
3523 (86%)
success
50 (35%/1%)
92 (65%/20%)
142 (3%)
Total
3668 (89%)
451 (11%)
4119 (100%)
3.1.4 Feature Engineering
We further attempt to exploit differences in the distributions of the continuous and categorical variables in our dataset by engineering new features that highlight the nuanced ways in which they diverge.
Show the code
hmm1 <- housing_subsidy_data %>%pivot_longer(cols =where(is.numeric), names_to ="variable", values_to ="value") %>%filter(variable !="y_numeric")ggplot(hmm1) +geom_density(aes(x = value, fill = y)) +facet_wrap(~variable, scales ="free") +labs(x="Output Var", y="Density", title ="Feature associations with the likelihood of entering a program",subtitle ="(Continous outcomes)") +theme(axis.text.x =element_text(hjust =1, angle =45))
Show the code
hmm2 <- housing_subsidy_data %>%pivot_longer(cols =where(is.character), names_to ="variable", values_to ="value") %>%filter(!variable =="y") %>%select(y_numeric, value, variable) %>%group_by(y_numeric, variable, value) %>%summarize(count =n())hmm_counts <- hmm2 %>%group_by(y_numeric, variable) %>%summarize(total =sum(count))hmm2 <-left_join(hmm2, hmm_counts) %>%mutate(pct = count/total*100)ggplot(hmm2) +geom_col(aes(x = value, y = pct, fill =as.factor(y_numeric)), position ="dodge") +facet_wrap(~variable, scales ="free") +labs(x="Output Var", y="Mean Value", title ="Feature associations with the likelihood of entering a program",subtitle ="(Categorical outcomes)") +theme(axis.text.x =element_text(hjust =1, angle =45))
Based on these distributions, we implement some of the following approaches (this is not a comprehensive list; full engineering can be seen in the code below):
Use binary markers to indicate whether the consumer confidence index (CCI) for a given observation falls into one of the observed peaks in CCI associated with higher rates of credit enrollment
Create a binary variable to indicate the presence of a university degree
Create a binary category for months to indicate whether a given observation occurred in a month associated with a higher rate of credit enrollment
We also drop redundant variables following these transformations.
Below, we use a binomial logistic regression to predict the classifications of our dependent variable, y. A logistic regression estimates, based on predictor variables, the probability that the dependent variable falls within one of the two classes. By setting a probability threshold (typically > 0.5), we can classify the observations. In logistic regression, our primary outcomes of interest are specificity, sensitivity, and the misclassification rate. Sensitivity (also called the true positive rate) measures the proportion of actual positives which are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition), and is complementary to the false negative rate. Specificity (also called the true negative rate) measures the proportion of negatives which are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition), and is complementary to the false positive rate. The misclassification rate is the proportion of the sum of false positives and false negatives out of the total number of observations in the sample. Although the standard approach is to pursue balanced classification accuracy–that is, a high true positive rate and a low false positive rate–in this particular instance, we seek to maximize the true positive rate above all else, given that the returns on a true positive outweigh the loss on a false positive by a factor of roughly five.
Generalized Linear Model
2677 samples
23 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (100 fold)
Summary of sample sizes: 2650, 2650, 2650, 2650, 2651, 2650, ...
Resampling results:
ROC Sens Spec
0.7506401 0.9794203 0.1566667
4.2 Prediction Accuracy
We find that our feature engineering does marginally increase the balanced accuracy of the model, as indicated by the increase in AUC. It is, comparatively speaking, a better model than the kitchen sink approach, although in absolute terms, it is only a moderately successful model. However, as discussed below, its increased sensitivity has meaningful implications for its functional utility in revenue generation and service delivery in the public sector.
In keeping with the goals of this project, we are not seeking to achieve a balanced accuracy, but rather to maximize revenue according to our cost benefit analysis. Per our assumptions, the credit allocates $5,000 per homeowner which can be used toward home improvement. Academic researchers in Philadelphia evaluated the program finding that houses that transacted after taking the credit, sold with a $10,000 premium, on average. Homes surrounding the repaired home see an aggregate premium of $56,000, on average.
True Positive - Predicted correctly homeowner would enter credit program; allocated the marketing resources, and 25% ultimately achieved the credit. Per the formula below, we count a net revenue gain of $12,400 per true positive.
True Negative - Predicted correctly homeowner would not enter the credit program, no marketing resources were allocated, and no credit was allocated. Thus, the net revenue gain for a true negative is $0.
False Positive - Predicted incorrectly homeowner would enter the credit program; allocated marketing resources; no credit allocated. We count a net revenue loss of $2,850 per false positive.
False Negative - We predicted that a homeowner would not enter the credit program but they did. These are likely homeowners who signed up for reasons unrelated to the marketing campaign. Thus, the net revenue gain for a false negative is $0.
Because the benefits of a true positive significantly outweigh the cost of a false positive (we earn $12,400 per true positive, while only losing $2,850 per false positive), we want to optimize our model to optimize sensitivity (true positive rate), rather than balanced accuracy.
Show the code
print_cb_tab(testProbs_base, "Cost-Benefit Analysis with Base Data")
Cost-Benefit Analysis with Base Data
Variable
Count
Revenue
Description
True_Negative
1247
0
Predicted correctly homeowner would not enter the credit program, no marketing resources were allocated, and no credit was allocated.
True_Positive
35
434000
Predicted correctly homeowner would enter credit program; allocated the marketing resources, and 25% ultimately achieved the credit
False_Negative
123
0
We predicted that a homeowner would not enter the credit program but they did.
False_Positive
37
-105450
Predicted incorrectly that a homeowner would enter the credit program.
Show the code
print_cb_tab(testProbs_eng, "Cost-Benefit Analysis with Engineered Data")
Cost-Benefit Analysis with Engineered Data
Variable
Count
Revenue
Description
True_Negative
1270
0
Predicted correctly homeowner would not enter the credit program, no marketing resources were allocated, and no credit was allocated.
True_Positive
38
471200
Predicted correctly homeowner would enter credit program; allocated the marketing resources, and 25% ultimately achieved the credit
False_Negative
120
0
We predicted that a homeowner would not enter the credit program but they did.
False_Positive
14
-39900
Predicted incorrectly that a homeowner would enter the credit program.
Based on our test datasets, we find that our engineered dataset does return a slightly higher maximum revenue than our base dataset, at a threshold around 0.15. It is notable that the threshold is so much lower than the standard 0.5. Again, this is because we are not interested in balanced class accuracy in this use case, but rather in maximizing net revenue, which requires prioritizing sensitivity (true positives) above all else.
Credit Counts and Revenues Across Models and Thresholds
Threshold
model
total_credits
revenue
0.2
Base Data
98750
666100
0.5
Base Data
43750
328550
0.15
Engineered Data
110000
803350
0.5
Engineered Data
47500
431300
5 Discussion
5.1 Importance of Feature Engineering
This exercise speaks to the importance of proper feature engineering in building a model. Several features, such as the consumer price index, were rendered meaningful by examining the full distributions of the data (rather than a summary statistic like the mean) and deriving features from these such as binary markers of peaks in the consumer price index. Further features such as interaction terms could be added in more sophisticated models. The result of this feature engineering was a small but financially meaningful projected net revenue in comparison to our base dataset, indicating that a few hours’ worth of data wrangling effort could translate to tens of thousands of dollars in additional revenue.
5.2 Possible New Features
To further improve the model, any number of new features could be added, evaluated, and engineered. Among those that we suspect may improve the model are homeowner race, License & Inspections violations, homeowner tenure, and enrollment in other City programs. As with other features, these should be explored in order to assess correlation and the possibility of deriving additional features from them.
5.2.1 Feature Exploration to Guide Outreach
Additionally, we note that a basic understanding of correlations can be used to further guide outreach. Noting that, for example, certain peaks in unemployment rates were correlated with enrollment in the credit, it might be wise to pursue more outreach during comparable future peaks.
5.3 Outcome Prioritization
Another notable component of this exercise is the divergence of academic and public sector priorities in model building. The assumed “best” aim for building a logistic regression model is to maximize balanced class accuracy of the model. In this case, however, given our cost benefit analysis, we found that maximizing sensitivity was most importance, given that the payoff for a true positive was much higher than the cost of a false positive, and that true negatives and false negatives both had net returns of $0. This speaks further to the importance of domain knowledge in data analysis and model building.
5.4 Limitations of Logistic Regression
Finally, this model runs up against the limits of logistic regression. Although logit models have fewer assumptions to satisfy than standard OLS regression, for example, they have weaknesses of their own. For one thing, they typically require very large datasets. A dataset of 40,000 or even 400,000 observations likely would have yielded more accurate results than our dataset of 4,000 observations. This is especially true when you have a severe class imbalance, as we did in our case. Of the 4119 observations in our dataset, only 10.9% of these observations were instances in which the person in question took the credit. This means that we are likely dealing with a rare event, according to which we might find more success with a penalized likelihood approach such as lasso, ridge, or elastic net regression.
6 Conclusion
We find that 1) the use of logistic regression as a predictive tool has the potential to increase revenues over the current standard operating practice of random outreach, and 2) that basic feature engineering of the dataset can be used to further increase net revenue using the predictive model. It is true that the bar is low: improving on random outreach, which is no better than a coin flip, is not hard to do. But given how commonly such practices are standard in local governments, this exercise speaks to the large return on investment that can be derived from even basic predictive modeling. The exercise furthermore indicates particular features–such as timing and economic factors–that should be used to guide outreach efforts in order to increase the response rates. Thus, the model (or really, any model with decent sensitivity) is worth putting into production, and the process of building it and improving it can also suggest ways to improve current outreach practices.