Targeting Housing Subsidies

MUSA 508, Lab #4

Author

Nissim Lebovits

Published

November 2, 2023

1 Summary

Emil City is considering a more proactive approach for targeting home owners who qualify for a home repair tax credit program. Based on a dataset from previous years, we implement a logistic regression to predict program enrollment and assess whether the use of a predictive model can improve the net revenue produced by the program. We compare the base dataset to a dataset including several engineered features. We find that, in both cases, the net revenue is a significant improvement over the current process of random outreach. We further find that feature engineering can improve the performance of the model and inform future outreach. Although we note several ways in which the model could be improved, we recommend that it be put into production given that even a moderately successful preditive model offers a substantial improvement over random outreach.

Show the code

library(tidyverse)
library(caret)
library(ggthemr)
library(pscl)
library(plotROC)
library(pROC)
library(scales)
library(rstatix)
library(ggpubr)
library(kableExtra)
library(crosstable)
library(ggcorrplot)
library(crosstable)
library(rsample)
library(gridExtra)
library(scales)

source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
source("hw4_fns.R")

options(scipen = 999)

ggthemr('pale')

housing_subsidy_path <- "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter6/housingSubsidy.csv"
housing_subsidy_data <- read.csv(housing_subsidy_path)
# log transform previous, pdays, and campagin

# collapse classes to avoid resampling issues in train/test split (only one observation of "illiterate" and "yes")
housing_subsidy_data <- housing_subsidy_data %>%
                          mutate(education = case_when(
                            education == "basic.4y" ~ "basic.4y or less",
                            education == "illiterate" ~ "basic.4y or less",
                            TRUE ~ education
                          ),taxLien = case_when(
                            taxLien == "yes" ~ "unknown/yes",
                            taxLien == "unknown" ~ "unknown/yes",
                            TRUE ~ taxLien
                          ))

2 Motivation

Emil City is considering a more proactive approach for targeting home owners who qualify for a home repair tax credit program. This tax credit program has been around for close to twenty years, and while the Department of Housing and Community Development (HCD) tries to proactively reach out to eligible homeowners ever year, the uptake of the credit is woefully inadequate. Typically not all eligible homeowners they reach out to who enter the program ultimately take the credit.

The consensus at HCD is that the low conversion rate is due to the fact that the agency reaches out to eligible homeowners at random. To move toward a more targeted campaign, we attempt to use a binomial logistic regression to assess whether client-level data collected from previous campaigns can drive decision-making to better target limited outreach sources and approve the success rates in getting homeowners into the home repair tax credit program.

3 Methods

3.1 Data

For this analysis, we use historic data collected by Emil City. To start, we assess the relationship of the various potential predictors to our dependent variable, y, and the relationship of the predictors to each other. We do this using a correlation plot, t-tests (for continuous variables), and chi-squared tests (for categorical variables). We are looking for variables that 1) have a statistically significant relationship with our dependent variable, y, and 2) do not exhibit strong multicolinearity with each other. We will first run a logistic regression on our full dataset and then, using these analytic tools, attempt to eliminate unhelpful features and/or engineer more helpful ones to create the most parsimonous model which maximizes the cost-benefits tradeoffs of the outreach approach. Some minor data cleaning is required: we find that the singular occurrence of “illiterate” in the education column in our base dataset makes it impossible to resample properly for our model, so we group it in with the “basic.4y” category as “basic.4y or less”. Likewise, we group the singular occurence of “yes” in the taxLien column in with the “unknown” category as “unknown/yes”.

3.1.1 Multicolinearity

Using a correlation plot, we can assess which predictors are multicolinear and therefore should be removed from our model. For our purposes, we define strong multicolinearity as anything with an r-value of greater than 0.9 or less than -0.9, which signals a redundancy in including both predictors (in other words, adding the second one does not help improve the predictive power of the model). In the case of multicollinearity, we usually leave one predictor and take out the rest that are strongly correlated with that one predictor. Here, we only drop two predictors: inflation rate and unemployment rate.

Show the code

corr <- round(cor(housing_subsidy_data %>% select(where(is.numeric))), 1)
p.mat <- cor_pmat(housing_subsidy_data %>% select(where(is.numeric)))

ggcorrplot(corr, p.mat = p.mat, hc.order = TRUE,
    type = "full", insig = "blank", lab = TRUE)

3.1.2 T-Tests for Continuous Variables

Using t-tests, we can assess whether continuous predictors differ across the two classes of the dependent variable in a statistically significant way. If they do, they are useful predictors for our model, but if they do not, they do not contribute meaningfully to the model. Based on these t-tests, we find that one of our continuous variables, X, is shows no statistically significant difference across classes and therefore can be discarded.

Show the code

ttest <- housing_subsidy_data %>%
  pivot_longer(cols = where(is.numeric), names_to = "variable", values_to = "value") %>%
  filter(variable != "y_numeric") %>%
  group_by(variable) %>%
  rstatix::t_test(value ~ y) %>%
  adjust_pvalue(method = "BH") %>%
  add_significance() %>%
  select(variable,
         p,
         p.adj,
         p.adj.signif)

ttest %>%
  kbl() %>%
  kable_minimal()

T-Tests for Continuous Variables
variable	p	p.adj	p.adj.signif
X	0.7170000	0.7170000	ns
age	0.0021600	0.0027000	**
campaign	0.0000000	0.0000000	****
cons.conf.idx	0.0057700	0.0064111	**
cons.price.idx	0.0000001	0.0000001	****
inflation_rate	0.0000000	0.0000000	****
pdays	0.0000000	0.0000000	****
previous	0.0000000	0.0000000	****
spent_on_repairs	0.0000000	0.0000000	****
unemploy_rate	0.0000000	0.0000000	****

3.1.3 Chi-Squared Tests for Categorical Predictors

Similarly, we can use chi-squared tests to assess whether categorical predictors show statistically signficant differences across the classes of the dependent variable. Based on these tests, we drop four predictors: “mortgage”, “taxbill_in_phl”, “day_of_week”, and “education”.

Show the code

cat_vars <- colnames(housing_subsidy_data %>% select(where(is.character)))

crosstable(housing_subsidy_data, 
           cat_vars, 
           by=y, 
           total="both", 
           percent_pattern="{n} ({p_row}/{p_col})", 
           percent_digits=0, 
           test = TRUE) %>%
  as_flextable()

Cross Tabulation of Categorical Variables
label	variable	y		Total	test
label	variable	no	yes	Total	test
job	admin.	879 (87%/24%)	133 (13%/29%)	1012 (25%)	p value: <0.0001 (Fisher's Exact Test for Count Data with simulated p-value (based on 100000 replicates))
	blue-collar	823 (93%/22%)	61 (7%/14%)	884 (21%)
	entrepreneur	140 (95%/4%)	8 (5%/2%)	148 (4%)
	housemaid	99 (90%/3%)	11 (10%/2%)	110 (3%)
	management	294 (91%/8%)	30 (9%/7%)	324 (8%)
	retired	128 (77%/3%)	38 (23%/8%)	166 (4%)
	self-employed	146 (92%/4%)	13 (8%/3%)	159 (4%)
	services	358 (91%/10%)	35 (9%/8%)	393 (10%)
	student	63 (77%/2%)	19 (23%/4%)	82 (2%)
	technician	611 (88%/17%)	80 (12%/18%)	691 (17%)
	unemployed	92 (83%/3%)	19 (17%/4%)	111 (3%)
	unknown	35 (90%/1%)	4 (10%/1%)	39 (1%)
	Total	3668 (89%)	451 (11%)	4119 (100%)
marital	divorced	403 (90%/11%)	43 (10%/10%)	446 (11%)	p value: 0.0165 (Fisher's Exact Test for Count Data)
	married	2257 (90%/62%)	252 (10%/56%)	2509 (61%)
	single	998 (87%/27%)	155 (13%/34%)	1153 (28%)
	unknown	10 (91%/0.3%)	1 (9%/0.2%)	11 (0.3%)
	Total	3668 (89%)	451 (11%)	4119 (100%)
education	basic.4y or less	392 (91%/11%)	38 (9%/8%)	430 (10%)	p value: 0.0011 (Pearson's Chi-squared test)
	basic.6y	211 (93%/6%)	17 (7%/4%)	228 (6%)
	basic.9y	531 (93%/14%)	43 (7%/10%)	574 (14%)
	high.school	824 (89%/22%)	97 (11%/22%)	921 (22%)
	professional.course	470 (88%/13%)	65 (12%/14%)	535 (13%)
	university.degree	1099 (87%/30%)	165 (13%/37%)	1264 (31%)
	unknown	141 (84%/4%)	26 (16%/6%)	167 (4%)
	Total	3668 (89%)	451 (11%)	4119 (100%)
taxLien	no	2913 (88%/79%)	402 (12%/89%)	3315 (80%)	p value: <0.0001 (Pearson's Chi-squared test)
	unknown/yes	755 (94%/21%)	49 (6%/11%)	804 (20%)
	Total	3668 (89%)	451 (11%)	4119 (100%)
mortgage	no	1637 (89%/45%)	202 (11%/45%)	1839 (45%)	p value: 0.7307 (Pearson's Chi-squared test)
	unknown	96 (91%/3%)	9 (9%/2%)	105 (3%)
	yes	1935 (89%/53%)	240 (11%/53%)	2175 (53%)
	Total	3668 (89%)	451 (11%)	4119 (100%)
taxbill_in_phl	no	693 (90%/19%)	77 (10%/17%)	770 (19%)	p value: 0.3495 (Pearson's Chi-squared test)
	yes	2975 (89%/81%)	374 (11%/83%)	3349 (81%)
	Total	3668 (89%)	451 (11%)	4119 (100%)
contact	cellular	2277 (86%/62%)	375 (14%/83%)	2652 (64%)	p value: <0.0001 (Pearson's Chi-squared test)
	telephone	1391 (95%/38%)	76 (5%/17%)	1467 (36%)
	Total	3668 (89%)	451 (11%)	4119 (100%)
month	apr	179 (83%/5%)	36 (17%/8%)	215 (5%)	p value: <0.0001 (Fisher's Exact Test for Count Data with simulated p-value (based on 100000 replicates))
	aug	572 (90%/16%)	64 (10%/14%)	636 (15%)
	dec	10 (45%/0.3%)	12 (55%/3%)	22 (1%)
	jul	652 (92%/18%)	59 (8%/13%)	711 (17%)
	jun	462 (87%/13%)	68 (13%/15%)	530 (13%)
	mar	20 (42%/1%)	28 (58%/6%)	48 (1%)
	may	1288 (93%/35%)	90 (7%/20%)	1378 (33%)
	nov	403 (90%/11%)	43 (10%/10%)	446 (11%)
	oct	44 (64%/1%)	25 (36%/6%)	69 (2%)
	sep	38 (59%/1%)	26 (41%/6%)	64 (2%)
	Total	3668 (89%)	451 (11%)	4119 (100%)
day_of_week	fri	685 (89%/19%)	83 (11%/18%)	768 (19%)	p value: 0.9723 (Pearson's Chi-squared test)
	mon	757 (89%/21%)	98 (11%/22%)	855 (21%)
	thu	764 (89%/21%)	96 (11%/21%)	860 (21%)
	tue	750 (89%/20%)	91 (11%/20%)	841 (20%)
	wed	712 (90%/19%)	83 (10%/18%)	795 (19%)
	Total	3668 (89%)	451 (11%)	4119 (100%)
poutcome	failure	387 (85%/11%)	67 (15%/15%)	454 (11%)	p value: <0.0001 (Pearson's Chi-squared test)
	nonexistent	3231 (92%/88%)	292 (8%/65%)	3523 (86%)
	success	50 (35%/1%)	92 (65%/20%)	142 (3%)
	Total	3668 (89%)	451 (11%)	4119 (100%)

3.1.4 Feature Engineering

We further attempt to exploit differences in the distributions of the continuous and categorical variables in our dataset by engineering new features that highlight the nuanced ways in which they diverge.

Show the code

hmm1 <- housing_subsidy_data %>%
  pivot_longer(cols = where(is.numeric), names_to = "variable", values_to = "value") %>%
  filter(variable != "y_numeric")

ggplot(hmm1) +
  geom_density(aes(x = value, fill = y)) +
  facet_wrap(~variable, scales = "free") +
    labs(x="Output Var", y="Density", 
         title = "Feature associations with the likelihood of entering a program",
         subtitle = "(Continous outcomes)") +
  theme(axis.text.x = element_text(hjust = 1, angle = 45))

Show the code

hmm2 <- housing_subsidy_data %>%
  pivot_longer(cols = where(is.character), names_to = "variable", values_to = "value") %>%
  filter(!variable == "y") %>%
  select(y_numeric, value, variable) %>%
  group_by(y_numeric, variable, value) %>%
  summarize(count = n())

hmm_counts <- hmm2 %>%
                group_by(y_numeric, variable) %>%
                summarize(total = sum(count))

hmm2 <- left_join(hmm2, hmm_counts) %>%
            mutate(pct = count/total*100)

ggplot(hmm2) +
  geom_col(aes(x = value, y = pct, fill = as.factor(y_numeric)), position = "dodge") +
  facet_wrap(~variable, scales = "free") +
    labs(x="Output Var", y="Mean Value", 
         title = "Feature associations with the likelihood of entering a program",
         subtitle = "(Categorical outcomes)") +
  theme(axis.text.x = element_text(hjust = 1, angle = 45))

Based on these distributions, we implement some of the following approaches (this is not a comprehensive list; full engineering can be seen in the code below):

Use binary markers to indicate whether the consumer confidence index (CCI) for a given observation falls into one of the observed peaks in CCI associated with higher rates of credit enrollment
Create a binary variable to indicate the presence of a university degree
Create a binary category for months to indicate whether a given observation occurred in a month associated with a higher rate of credit enrollment

We also drop redundant variables following these transformations.

Show the code

drop_vars <- c("mortgage", 
               "taxbill_in_phl", 
               "day_of_week", 
               "education", 
                "inflation_rate",
               "unemploy_rate", 
              "cons.conf.idx",
              "cons.price.idx",
              "spent_on_repairs",
              "unemploy_rate",
              "pdays",
              "previous",
              "job",
               "X" , 
              "age",
              "month"
               )  

hs_data_eng <- housing_subsidy_data %>%
  mutate(
    cci_peak_50_indicator = ifelse(cons.conf.idx >= -51 & cons.conf.idx <= -49, 1, 0),
    cci_peak_40_indicator = ifelse(cons.conf.idx >= -41 & cons.conf.idx <= -39, 1, 0),
    cci_peak_35_indicator = ifelse(cons.conf.idx >= -36 & cons.conf.idx <= -34, 1, 0),
    cpi_peak_92_5_indicator = ifelse(cons.price.idx >= 92.3 & cons.price.idx <= 92.7, 1, 0),
    cpi_peak_93_5_indicator = ifelse(cons.price.idx >= 93.3 & cons.price.idx <= 93.7, 1, 0),
    cpi_peak_94_5_indicator = ifelse(cons.price.idx >= 94.3 & cons.price.idx <= 94.7, 1, 0),
    sor_peak_5025_indicator = ifelse(spent_on_repairs >= 5020 & spent_on_repairs <= 5030, 1, 0),
    sor_peak_5075_indicator = ifelse(spent_on_repairs >= 5070 & spent_on_repairs <= 5080, 1, 0),
    sor_peak_5110_indicator = ifelse(spent_on_repairs >= 5105 & spent_on_repairs <= 5115, 1, 0),
    sor_peak_5175_indicator = ifelse(spent_on_repairs >= 5170 & spent_on_repairs <= 5180, 1, 0),
    ir_peak_1_5_indicator = ifelse(inflation_rate >= 1.3 & inflation_rate <= 1.7, 1, 0),
    ir_peak_4_5_indicator = ifelse(inflation_rate >= 4.3 & inflation_rate <= 4.7, 1, 0),
    ur_peak_4_5_indicator = ifelse(unemploy_rate >= 4.3 & unemploy_rate <= 4.7, 1, 0),
    univ_deg = ifelse(education == "university.degree", 1, 0),
    pdays_special = ifelse(pdays >= 999, 1, 0),
    has_previous_contact = ifelse(previous > 0, 1, 0),
    special_months = ifelse(month %in% c("dec", "mar", "oct", "sep"), 1, 0),
    job_grouped = case_when(
      job %in% c("blue-collar", "services") ~ "blue-collar_services",
      job %in% c("admin.", "management") ~ "admin_management",
      TRUE ~ as.character(job)
    )
    ) %>%
  select(-c(drop_vars))

3.2 Model

Below, we use a binomial logistic regression to predict the classifications of our dependent variable, y. A logistic regression estimates, based on predictor variables, the probability that the dependent variable falls within one of the two classes. By setting a probability threshold (typically > 0.5), we can classify the observations. In logistic regression, our primary outcomes of interest are specificity, sensitivity, and the misclassification rate. Sensitivity (also called the true positive rate) measures the proportion of actual positives which are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition), and is complementary to the false negative rate. Specificity (also called the true negative rate) measures the proportion of negatives which are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition), and is complementary to the false positive rate. The misclassification rate is the proportion of the sum of false positives and false negatives out of the total number of observations in the sample. Although the standard approach is to pursue balanced classification accuracy–that is, a high true positive rate and a low false positive rate–in this particular instance, we seek to maximize the true positive rate above all else, given that the returns on a true positive outweigh the loss on a false positive by a factor of roughly five.

4 Results

4.1 Regression Results

4.1.1 Base Model

Show the code

set.seed(153)

trainIndex_base <- initial_split(housing_subsidy_data, prop = 0.65, strata = y)
houseSubTrain_base <- training(trainIndex_base)
houseSubTest_base  <-  testing(trainIndex_base)

data_for_model_base <- houseSubTrain_base %>% dplyr::select(-y)

hs_reg_base <- glm(y_numeric ~ ., data = data_for_model_base, family = binomial(link = "logit"))

print(summary(hs_reg_base))


Call:
glm(formula = y_numeric ~ ., family = binomial(link = "logit"), 
    data = data_for_model_base)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1466  -0.3891  -0.2893  -0.2151   2.8992  

Coefficients:
                                   Estimate     Std. Error z value Pr(>|z|)    
(Intercept)                  -375.375334813  134.206235491  -2.797 0.005158 ** 
X                               0.000005684    0.000062137   0.091 0.927109    
age                             0.019194618    0.008650577   2.219 0.026495 *  
jobblue-collar                 -0.458211273    0.300570023  -1.524 0.127390    
jobentrepreneur                -0.340878750    0.488774948  -0.697 0.485543    
jobhousemaid                    0.249138239    0.469778272   0.530 0.595882    
jobmanagement                  -0.331849543    0.328227194  -1.011 0.311999    
jobretired                     -0.168635155    0.385791584  -0.437 0.662028    
jobself-employed               -0.506492521    0.457308275  -1.108 0.268056    
jobservices                     0.081581990    0.295806229   0.276 0.782705    
jobstudent                     -0.212932859    0.471609606  -0.452 0.651628    
jobtechnician                   0.144060299    0.243571341   0.591 0.554219    
jobunemployed                   0.771356946    0.397874280   1.939 0.052538 .  
jobunknown                     -0.285923881    0.874190741  -0.327 0.743613    
maritalmarried                  0.585432633    0.282032747   2.076 0.037916 *  
maritalsingle                   0.609962178    0.317829445   1.919 0.054965 .  
maritalunknown                -12.342783505  482.683364330  -0.026 0.979599    
educationbasic.6y               0.424612142    0.405769762   1.046 0.295360    
educationbasic.9y              -0.211433998    0.347517757  -0.608 0.542914    
educationhigh.school           -0.289898968    0.329049324  -0.881 0.378307    
educationprofessional.course    0.077330631    0.346055010   0.223 0.823175    
educationuniversity.degree     -0.157533843    0.327145797  -0.482 0.630133    
educationunknown               -0.258109587    0.445914754  -0.579 0.562703    
taxLienunknown/yes             -0.030060382    0.229360231  -0.131 0.895726    
mortgageunknown                -0.401048124    0.529630811  -0.757 0.448917    
mortgageyes                    -0.218076617    0.150056953  -1.453 0.146143    
taxbill_in_phlyes               0.029652548    0.198435605   0.149 0.881213    
contacttelephone               -1.164617524    0.306312667  -3.802 0.000144 ***
monthaug                       -0.052346237    0.448833104  -0.117 0.907155    
monthdec                        0.701937873    0.796306257   0.881 0.378051    
monthjul                       -0.427955308    0.380392293  -1.125 0.260573    
monthjun                       -0.740113068    0.473800996  -1.562 0.118270    
monthmar                        1.771357631    0.565790208   3.131 0.001744 ** 
monthmay                       -0.555038722    0.316995224  -1.751 0.079957 .  
monthnov                       -0.855157801    0.450941097  -1.896 0.057909 .  
monthoct                       -0.224952489    0.557253611  -0.404 0.686448    
monthsep                        0.132240885    0.654302360   0.202 0.839831    
day_of_weekmon                 -0.050122235    0.241692832  -0.207 0.835713    
day_of_weekthu                  0.241047475    0.235069808   1.025 0.305161    
day_of_weektue                  0.284029237    0.241565224   1.176 0.239680    
day_of_weekwed                  0.382043464    0.240842198   1.586 0.112676    
campaign                       -0.065420164    0.042091839  -1.554 0.120131    
pdays                          -0.001456683    0.000755864  -1.927 0.053958 .  
previous                        0.060570735    0.196233946   0.309 0.757576    
poutcomenonexistent             0.854490967    0.343550934   2.487 0.012874 *  
poutcomesuccess                 0.628782003    0.748157268   0.840 0.400661    
unemploy_rate                  -1.784578767    0.500897361  -3.563 0.000367 ***
cons.price.idx                  3.046428338    0.887757316   3.432 0.000600 ***
cons.conf.idx                   0.074598571    0.029216331   2.553 0.010670 *  
inflation_rate                 -0.179421334    0.455667352  -0.394 0.693762    
spent_on_repairs                0.017794714    0.010826026   1.644 0.100239    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1849.1  on 2676  degrees of freedom
Residual deviance: 1358.4  on 2626  degrees of freedom
AIC: 1460.4

Number of Fisher Scoring iterations: 14

Show the code

print(pR2(hs_reg_base)[4]) # mcfadden's R-squared--0.2 to 0.4 are considered good model

fitting null model for pseudo-r2
 McFadden 
0.2653621

Show the code

testProbs_base <- data.frame(outcome = as.factor(houseSubTest_base$y_numeric),
                        probs = predict(hs_reg_base, houseSubTest_base, type = "response"))%>%
                        mutate(pred_outcome  = as.factor(ifelse(probs > 0.5 , 1, 0)))


print(caret::confusionMatrix(testProbs_base$pred_outcome, testProbs_base$outcome,
                       positive = "1"))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1247  123
         1   37   35
                                          
               Accuracy : 0.889           
                 95% CI : (0.8717, 0.9048)
    No Information Rate : 0.8904          
    P-Value [Acc > NIR] : 0.5875          
                                          
                  Kappa : 0.2531          
                                          
 Mcnemar's Test P-Value : 0.00000000001819
                                          
            Sensitivity : 0.22152         
            Specificity : 0.97118         
         Pos Pred Value : 0.48611         
         Neg Pred Value : 0.91022         
             Prevalence : 0.10957         
         Detection Rate : 0.02427         
   Detection Prevalence : 0.04993         
      Balanced Accuracy : 0.59635         
                                          
       'Positive' Class : 1

Show the code

ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

cvFit_base <- train(y ~ .,
               data= houseSubTrain_base %>% dplyr::select(-y_numeric),
                method="glm",
                family="binomial",
                metric="ROC",
                trControl = ctrl)

cvFit_base

Generalized Linear Model 

2677 samples
  20 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (100 fold) 
Summary of sample sizes: 2650, 2650, 2650, 2650, 2650, 2651, ... 
Resampling results:

  ROC        Sens       Spec 
  0.7966999  0.9756522  0.275

4.1.2 Feature Engineered Model

Show the code

set.seed(1353)

trainIndex_eng <- initial_split(hs_data_eng, prop = 0.65, strata = y)
houseSubTrain_eng <- training(trainIndex_eng)
houseSubTest_eng  <- testing(trainIndex_eng)

data_for_model_eng <- houseSubTrain_eng %>% dplyr::select(-y)

hs_reg_eng <- glm(y_numeric ~ ., data = data_for_model_eng, family = binomial(link = "logit"))

print(summary(hs_reg_eng))


Call:
glm(formula = y_numeric ~ ., family = binomial(link = "logit"), 
    data = data_for_model_eng)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8939  -0.4431  -0.3408  -0.2669   2.9226  

Coefficients: (3 not defined because of singularities)
                                  Estimate Std. Error z value      Pr(>|z|)    
(Intercept)                      -2.328120   0.808906  -2.878      0.004001 ** 
maritalmarried                    0.085208   0.238423   0.357      0.720805    
maritalsingle                     0.150650   0.258898   0.582      0.560642    
maritalunknown                   -0.015542   1.247975  -0.012      0.990064    
taxLienunknown/yes                0.011074   0.206107   0.054      0.957153    
contacttelephone                 -0.862677   0.205265  -4.203 0.00002636890 ***
campaign                         -0.090209   0.042065  -2.145      0.031990 *  
poutcomenonexistent               0.002712   0.207737   0.013      0.989585    
poutcomesuccess                   1.697365   0.799936   2.122      0.033848 *  
cci_peak_50_indicator             0.654829   0.413921   1.582      0.113646    
cci_peak_40_indicator             1.627776   0.271282   6.000 0.00000000197 ***
cci_peak_35_indicator             2.139943   0.548200   3.904 0.00009477845 ***
cpi_peak_92_5_indicator           0.718433   0.353706   2.031      0.042239 *  
cpi_peak_93_5_indicator          -0.759082   0.281592  -2.696      0.007024 ** 
cpi_peak_94_5_indicator           0.442343   0.299104   1.479      0.139169    
sor_peak_5025_indicator           1.183990   0.693420   1.707      0.087736 .  
sor_peak_5075_indicator           0.404320   0.280511   1.441      0.149481    
sor_peak_5110_indicator                 NA         NA      NA            NA    
sor_peak_5175_indicator         -11.332439 324.743897  -0.035      0.972162    
ir_peak_1_5_indicator             0.211917   0.214625   0.987      0.323456    
ir_peak_4_5_indicator             2.848700   1.439293   1.979      0.047789 *  
ur_peak_4_5_indicator                   NA         NA      NA            NA    
univ_deg                          0.182836   0.171715   1.065      0.286981    
pdays_special                     0.124378   0.784226   0.159      0.873984    
has_previous_contact                    NA         NA      NA            NA    
special_months                    1.141943   0.311539   3.665      0.000247 ***
job_groupedblue-collar_services  -0.169535   0.206719  -0.820      0.412145    
job_groupedentrepreneur          -0.279740   0.445848  -0.627      0.530375    
job_groupedhousemaid             -0.520804   0.550744  -0.946      0.344334    
job_groupedretired                0.386762   0.302934   1.277      0.201702    
job_groupedself-employed         -0.853987   0.437619  -1.951      0.051005 .  
job_groupedstudent                0.161180   0.423808   0.380      0.703713    
job_groupedtechnician            -0.015611   0.212448  -0.073      0.941421    
job_groupedunemployed             0.257180   0.377014   0.682      0.495144    
job_groupedunknown               -0.873846   0.954530  -0.915      0.359944    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1849.1  on 2676  degrees of freedom
Residual deviance: 1506.6  on 2645  degrees of freedom
AIC: 1570.6

Number of Fisher Scoring iterations: 11

Show the code

print(pR2(hs_reg_eng)[4]) # mcfadden's R-squared--0.2 to 0.4 are considered good models

fitting null model for pseudo-r2
 McFadden 
0.1852409

Show the code

testProbs_eng <- data.frame(outcome = as.factor(houseSubTest_eng$y_numeric),
                        probs = predict(hs_reg_eng, houseSubTest_eng, type = "response")) %>%
                        mutate(pred_outcome  = as.factor(ifelse(probs > 0.5 , 1, 0)))


print(caret::confusionMatrix(testProbs_eng$pred_outcome, testProbs_eng$outcome,
                       positive = "1"))

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1270  120
         1   14   38
                                             
               Accuracy : 0.9071             
                 95% CI : (0.8909, 0.9216)   
    No Information Rate : 0.8904             
    P-Value [Acc > NIR] : 0.0219             
                                             
                  Kappa : 0.3253             
                                             
 Mcnemar's Test P-Value : <0.0000000000000002
                                             
            Sensitivity : 0.24051            
            Specificity : 0.98910            
         Pos Pred Value : 0.73077            
         Neg Pred Value : 0.91367            
             Prevalence : 0.10957            
         Detection Rate : 0.02635            
   Detection Prevalence : 0.03606            
      Balanced Accuracy : 0.61480            
                                             
       'Positive' Class : 1

Show the code

ctrl <- trainControl(method = "cv", number = 100, classProbs = TRUE, summaryFunction = twoClassSummary)

cvFit_eng <- train(y ~ .,
               data= houseSubTrain_eng %>% dplyr::select(-y_numeric),
                method="glm",
                family="binomial",
                metric="ROC",
                trControl = ctrl)

cvFit_eng

Generalized Linear Model 

2677 samples
  23 predictor
   2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (100 fold) 
Summary of sample sizes: 2650, 2650, 2650, 2650, 2651, 2650, ... 
Resampling results:

  ROC        Sens       Spec     
  0.7506401  0.9794203  0.1566667

4.2 Prediction Accuracy

We find that our feature engineering does marginally increase the balanced accuracy of the model, as indicated by the increase in AUC. It is, comparatively speaking, a better model than the kitchen sink approach, although in absolute terms, it is only a moderately successful model. However, as discussed below, its increased sensitivity has meaningful implications for its functional utility in revenue generation and service delivery in the public sector.

Show the code

prob_dens_base <- plot_prob_dens(testProbs_base, "Base Data")

prob_dens_eng <- plot_prob_dens(testProbs_eng, "Engineered Data")

ggarrange(prob_dens_base, prob_dens_base, nrow = 2)

Show the code

gof_base <- plot_gof(cvFit_base, "Base Data")

gof_eng <- plot_gof(cvFit_eng, "Engineered Data")

ggarrange(gof_base, gof_eng, nrow = 2)

Show the code

roc_base <- roc_plot(testProbs_base, "Base Data")

roc_eng <- roc_plot(testProbs_eng, "Engineered Data")

ggarrange(roc_base, roc_eng, ncol = 2)

4.3 Cost-Benefit Analysis

In keeping with the goals of this project, we are not seeking to achieve a balanced accuracy, but rather to maximize revenue according to our cost benefit analysis. Per our assumptions, the credit allocates $5,000 per homeowner which can be used toward home improvement. Academic researchers in Philadelphia evaluated the program finding that houses that transacted after taking the credit, sold with a $10,000 premium, on average. Homes surrounding the repaired home see an aggregate premium of $56,000, on average.

True Positive - Predicted correctly homeowner would enter credit program; allocated the marketing resources, and 25% ultimately achieved the credit. Per the formula below, we count a net revenue gain of $12,400 per true positive.

$(0.25(66,000 - 5,000 - 2,850) - 0.75 (2,850)) * count = 12,400 * count$
True Negative - Predicted correctly homeowner would not enter the credit program, no marketing resources were allocated, and no credit was allocated. Thus, the net revenue gain for a true negative is $0.
False Positive - Predicted incorrectly homeowner would enter the credit program; allocated marketing resources; no credit allocated. We count a net revenue loss of $2,850 per false positive.
False Negative - We predicted that a homeowner would not enter the credit program but they did. These are likely homeowners who signed up for reasons unrelated to the marketing campaign. Thus, the net revenue gain for a false negative is $0.

Because the benefits of a true positive significantly outweigh the cost of a false positive (we earn $12,400 per true positive, while only losing $2,850 per false positive), we want to optimize our model to optimize sensitivity (true positive rate), rather than balanced accuracy.

Show the code

print_cb_tab(testProbs_base, "Cost-Benefit Analysis with Base Data")

Cost-Benefit Analysis with Base Data
Variable	Count	Revenue	Description
True_Negative	1247	0	Predicted correctly homeowner would not enter the credit program, no marketing resources were allocated, and no credit was allocated.
True_Positive	35	434000	Predicted correctly homeowner would enter credit program; allocated the marketing resources, and 25% ultimately achieved the credit
False_Negative	123	0	We predicted that a homeowner would not enter the credit program but they did.
False_Positive	37	-105450	Predicted incorrectly that a homeowner would enter the credit program.

Show the code

print_cb_tab(testProbs_eng, "Cost-Benefit Analysis with Engineered Data")

Cost-Benefit Analysis with Engineered Data
Variable	Count	Revenue	Description
True_Negative	1270	0	Predicted correctly homeowner would not enter the credit program, no marketing resources were allocated, and no credit was allocated.
True_Positive	38	471200	Predicted correctly homeowner would enter credit program; allocated the marketing resources, and 25% ultimately achieved the credit
False_Negative	120	0	We predicted that a homeowner would not enter the credit program but they did.
False_Positive	14	-39900	Predicted incorrectly that a homeowner would enter the credit program.

Show the code

thresh_base <- make_thresh(testProbs_base)
thresh_eng <- make_thresh(testProbs_eng)

thresh_tab_base <- print_thresh_tab(thresh_base, "Base Data")
thresh_tab_eng <- print_thresh_tab(thresh_eng, "Engineered Data")

ggarrange(thresh_tab_base, thresh_tab_eng, nrow = 2)

Based on our test datasets, we find that our engineered dataset does return a slightly higher maximum revenue than our base dataset, at a threshold around 0.15. It is notable that the threshold is so much lower than the standard 0.5. Again, this is because we are not interested in balanced class accuracy in this use case, but rather in maximizing net revenue, which requires prioritizing sensitivity (true positives) above all else.

Show the code

plot_count_base <- plot_count(thresh_base, "Base Data")
plot_rev_base <-plot_rev(thresh_base, "Base Data")

plot_count_eng <- plot_count(thresh_eng, "Engineered Data")
plot_rev_eng <- plot_rev(thresh_eng, "Engineered Data")

plot_count_base

Show the code

plot_rev_base

Show the code

plot_count_eng

Show the code

plot_rev_eng

Show the code

thresh_base_credits <- thresh_base %>% 
  mutate(Threshold = as.character(Threshold)) %>%
    group_by(Threshold) %>% 
    filter(Variable == "Count_TP") %>%
    summarize(total_credits = (sum(Count))* 5000 * 0.25) %>%
    mutate(model = "Base Data")

thresh_base_rev <- thresh_base %>% 
  mutate(Threshold = as.character(Threshold)) %>%
    group_by(Threshold) %>% 
    summarize(revenue = sum(Revenue)) %>%
    mutate(model = "Base Data")

thresh_base_dt <- left_join(thresh_base_credits, thresh_base_rev) %>%
    filter(Threshold == 0.5 | revenue == max(revenue))

thresh_eng_credits <- thresh_eng %>% 
  mutate(Threshold = as.character(Threshold)) %>%
    group_by(Threshold) %>% 
    filter(Variable == "Count_TP") %>%
    summarize(total_credits = (sum(Count))* 5000 * 0.25) %>%
    mutate(model = "Engineered Data")


thresh_eng_rev <- thresh_eng %>% 
  mutate(Threshold = as.character(Threshold)) %>%
    group_by(Threshold) %>% 
    summarize(revenue = sum(Revenue)) %>%
    mutate(model = "Engineered Data")

thresh_eng_dt <- left_join(thresh_eng_credits, thresh_eng_rev) %>%
    filter(Threshold == 0.5 | revenue == max(revenue))

thresh_full_dt <- rbind(thresh_base_dt, thresh_eng_dt) %>%
  select(Threshold, model, total_credits, revenue)

thresh_full_dt %>%
    kbl(caption = "Credit Counts and Revenues Across Models and Thresholds") %>%
    kable_minimal()

Credit Counts and Revenues Across Models and Thresholds
Threshold	model	total_credits	revenue
0.2	Base Data	98750	666100
0.5	Base Data	43750	328550
0.15	Engineered Data	110000	803350
0.5	Engineered Data	47500	431300

5 Discussion

5.1 Importance of Feature Engineering

This exercise speaks to the importance of proper feature engineering in building a model. Several features, such as the consumer price index, were rendered meaningful by examining the full distributions of the data (rather than a summary statistic like the mean) and deriving features from these such as binary markers of peaks in the consumer price index. Further features such as interaction terms could be added in more sophisticated models. The result of this feature engineering was a small but financially meaningful projected net revenue in comparison to our base dataset, indicating that a few hours’ worth of data wrangling effort could translate to tens of thousands of dollars in additional revenue.

5.2 Possible New Features

To further improve the model, any number of new features could be added, evaluated, and engineered. Among those that we suspect may improve the model are homeowner race, License & Inspections violations, homeowner tenure, and enrollment in other City programs. As with other features, these should be explored in order to assess correlation and the possibility of deriving additional features from them.

5.2.1 Feature Exploration to Guide Outreach

Additionally, we note that a basic understanding of correlations can be used to further guide outreach. Noting that, for example, certain peaks in unemployment rates were correlated with enrollment in the credit, it might be wise to pursue more outreach during comparable future peaks.

5.3 Outcome Prioritization

Another notable component of this exercise is the divergence of academic and public sector priorities in model building. The assumed “best” aim for building a logistic regression model is to maximize balanced class accuracy of the model. In this case, however, given our cost benefit analysis, we found that maximizing sensitivity was most importance, given that the payoff for a true positive was much higher than the cost of a false positive, and that true negatives and false negatives both had net returns of $0. This speaks further to the importance of domain knowledge in data analysis and model building.

5.4 Limitations of Logistic Regression

Finally, this model runs up against the limits of logistic regression. Although logit models have fewer assumptions to satisfy than standard OLS regression, for example, they have weaknesses of their own. For one thing, they typically require very large datasets. A dataset of 40,000 or even 400,000 observations likely would have yielded more accurate results than our dataset of 4,000 observations. This is especially true when you have a severe class imbalance, as we did in our case. Of the 4119 observations in our dataset, only 10.9% of these observations were instances in which the person in question took the credit. This means that we are likely dealing with a rare event, according to which we might find more success with a penalized likelihood approach such as lasso, ridge, or elastic net regression.

6 Conclusion

We find that 1) the use of logistic regression as a predictive tool has the potential to increase revenues over the current standard operating practice of random outreach, and 2) that basic feature engineering of the dataset can be used to further increase net revenue using the predictive model. It is true that the bar is low: improving on random outreach, which is no better than a coin flip, is not hard to do. But given how commonly such practices are standard in local governments, this exercise speaks to the large return on investment that can be derived from even basic predictive modeling. The exercise furthermore indicates particular features–such as timing and economic factors–that should be used to guide outreach efforts in order to increase the response rates. Thus, the model (or really, any model with decent sensitivity) is worth putting into production, and the process of building it and improving it can also suggest ways to improve current outreach practices.