Assignment 05 - Public Policy Analytics | Jarred Randall


MOTIVATION


The Emil City Department of Housing and Community Development extends a home repair tax credit program to eligible homeowners. The program spans nearly two decades and historically, homeowners have leveraged this credit to successfully boost their property values by an average of $10,000. This has resulted in neighboring properties substantially increasing in value, totaling a premium of $56,000.

Typically, however, not all eligible homeowners request the credit. The Department of Housing and Community Development finds the conversion rate minimal - perhaps a result of randomly contacting eligible homeowners. The cost of marketing each credit is $2,850 while the department must also spend an additional $5,000 per credit, since this is the amount of the subsidy granted.

Consequently, this cost-benefit analysis intends to steer the Department of Housing and Community Development towards a more precise, intentional program. To guide allocation of funds and increase participation in the credit program, our model aims to inform productive outreach for homeowners genuinely interested in the subsidy. By leveraging our insights regarding utilization of credits, we ultimately uncover subsidy eligibility centered around two desired outcomes: cost optimization and increased participation.


EXPLORATION


We conduct our analysis through historical campaign records. These records provide a detailed view about traits of homeowners and their utilization of the repair credit. We start with the continuous features collected at the time of the campaign: age, unemployment rate, consumer price index, consumer confidence index, inflation rate, and annual home repair expenditure. Additional data included is the number of contacts during the campaign; the number of contacts prior to the campaign; and the days since last contacted for the previous program.

Interestingly, the variables with weak correlation to credit utilization are age, consumer price index, consumer confidence index, and annual home repair expenditure. We, however, note a correlation using the unemployment and inflation rates; the number of contacts during the campaign; the number of contacts prior to the campaign; and the days since last contacted for the previous program.

Homeowners who were contacted previously and recently show greatest susceptibility to the subsidy offer. Furthermore, they were more inclined to accept the offer when contacted during periods of low inflation, perhaps indicating a desire to invest in repairs during periods of economic stability. Further, when unemployment rates are lower, homeowners are also inclined to accept the offer. Stability and disposable income could be key to influencing the likelihood of offer acceptance - highlighting the importance of economic conditions in homeowner decision-making.

housingsubsidy %>%
  dplyr::select(y, previous, age, cons.price.idx, cons.conf.idx, inflation_rate, spent_on_repairs, unemploy_rate,  previous, pdays) %>%
  gather(Variable, value, -y) %>%
  ggplot(aes(y, value, fill=y)) + 
  geom_bar(position = "dodge", stat = "summary", fun = "mean") + 
  facet_wrap(~Variable, scales = "free", ncol = 2) +
  scale_fill_manual(values = paletteV4) +
  labs(x="Accepted Credit", y="Value", 
       title = "Feature Associations With Likelihood Of Accepting Subsidy",
       subtitle = "Continuous Variable Outcomes") +
  theme_bw() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1),
        plot.subtitle = element_text(size = 12,face = "italic"),
        plot.title = element_text(size = 14.5, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "#C8C8C8", fill=NA, linewidth =0.8))

The majority of homeowners with mortgages do not take the credit. Similarly, a significant number of homeowners with tax bill in Philadelphia chose not to take advantage of the subsidies. This might suggest a negative correlation between these variables and subsidy utilization. However, it’s important to note that over 75% of homeowners in the dataset declined the credit. Further, individuals with a lien on their home are also more likely to make use of the credit in contrast to those without a lien on their property.

housingsubsidy %>%
  dplyr::select(y, mortgage, taxbill_in_phl, taxLien) %>%
  gather(Variable, value, -y) %>%
  count(Variable, value, y) %>%
  ggplot(aes(y, n, fill=y)) + 
  geom_bar(position = "dodge", stat = "identity") +  # Specifying stat as "identity" to use raw counts
  facet_wrap(~Variable, scales = "free") +
  scale_fill_manual(values = paletteV4) +
  labs(x="Accepted Credit", y="Value", 
       title = "Feature Associations With Likelihood Of Accepting Subsidy",
       subtitle = "Response: Yes Or No") +
  theme_bw() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1),
        plot.subtitle = element_text(size = 12,face = "italic"),
        plot.title = element_text(size = 14.5, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "#C8C8C8", fill=NA, linewidth =0.8))

Lastly, we analyze the categorical variables: occupation, marital status, education attainment, method of contact, month of last contact, day of week last contacted, and previous marketing outcomes. The majority of the categorical variables demonstrate no influence on the decision of homeowners to either accept or reject the subsidy. Nonetheless, month of last contact; marital status; occupation; and method of contact prove moderately helpful finding correlation with credit utilization.

We notice that individuals employed in administrative roles along with married individuals are more inclined to utilize the tax credit. Furthermore, the timing of contact seems to have a significant impact on the credit acceptance with homeowners contacted in December, March, October, and September displaying a higher likelihood of accepting the subsidy. Moreover, homeowners who were contacted via cell phone, as opposed to a landline/ traditional telephone, were also more likely to accept the offer.

housingsubsidy %>%
    dplyr::select(y, contact, job, marital, education, month, day_of_week, poutcome) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, scales="free", ncol = 2) +
        scale_fill_manual(values = paletteV4) +
        labs(x="Accepted Credit", y="Count",
             title = "Feature Associations With Likelihood Of Accepting Subsidy",
             subtitle = "Categorical Variable Outcomes",
             fill = "Accepted Credit") +
        theme_bw() +
        theme(axis.text.x = element_text(angle = 45, hjust = 1),
              plot.subtitle = element_text(size = 12,face = "italic"),
              plot.title = element_text(size = 14.5, face = "bold"),
              panel.background = element_blank(),
              panel.border = element_rect(colour = "#C8C8C8", fill=NA, linewidth =0.8))


FEATURE ENGINEERING


Leveraging our insights regarding homeowner characteristics, we construct new features to boost our model’s predictive performance. The new category groupings employed to improve the goodness of fit in our model are: season, age, education, contact, job, inflation, and unemployment. With season, we split the months into four quarters following the fiscal calendar year. We also reclassify age into four categories, job into four levels of income, and education into four levels. Finally, we create categories for the time since the last marketing contact and whether an individual has been contacted; and the inflation and employment rates at the time of the campaign.

# Financial Calendar, Fiscal Quarters (Q1, Q2, Q3, Q4)

housingsubsidy <-
  housingsubsidy %>%
  mutate(Season = case_when(
    month == "jan" |month == "feb" | month == "mar" ~ "Quarter 01",
    month == "apr" |month == "may" | month == "jun" ~ "Quarter 02",
    month == "jul" |month == "aug" | month == "sep" ~ "Quarter 03",
    month == "oct" |month == "nov" | month == "dec" ~ "Quarter 04"))

# Age By Generation, Split Into Four Categories

housingsubsidy <-
  housingsubsidy %>%
  mutate(Age = case_when(
    age >= 18 & age < 22  ~ "Generation Z",
    age >= 23 & age < 38  ~ "Millenials",
    age >= 39 & age < 54  ~ "Generation X",
    TRUE ~ "Baby Boomers"))

# Education Level, Split Into Four Categories

housingsubsidy <-
  housingsubsidy %>%
  mutate(Education = case_when(
    education == "basic.9y" |education == "basic.6y" | education == "basic.4y" ~ "K Through Middle",
    education == "high.school"  ~ "High School Level",
    education == "university.degree" |education == "professional.course"  ~ "Higher Education",
    education == "unknown" |education == "illiterate"  ~ "Low Or Uneducated"))

# Previous Project Contact, Split Into Five Categories

housingsubsidy <-
  housingsubsidy %>%
  mutate(Contact = case_when(pdays == 999 ~ "No Contact",
                             pdays < 7 ~ "One Week",
                             pdays >= 7 & pdays < 21 ~ "Two Weeks",
                             pdays >= 22 & pdays < 29 ~ "Three Weeks",
                             pdays >= 30 ~ "More Than Three Weeks"))

# Jobs, Split Into Four Categories

housingsubsidy <- housingsubsidy %>%
  mutate(
    Jobs_Grouped = case_when(
      job %in% c("blue-collar", "technician", "services", "housemaid") ~ "Working Class",
      job %in% c("admin.", "management") ~ "White Collar",
      job %in% c("student", "retired", "unemployed") ~ "Not Working",
      job %in% c("entrepreneur", "self-employed") ~ "Self-employed",
      TRUE ~ "Unknown"))

# Inflation Level, Split Into Two Categories: High And Low

housingsubsidy <-
  housingsubsidy %>%
  mutate(
    inflation_level = case_when(
    inflation_rate > 4  ~ "high",
    inflation_rate < 4  ~ "low"))

# Employment Level, Split Into Two Categories: High And Low

housingsubsidy <-
  housingsubsidy %>%
  mutate(
    unemployment_level = case_when(
    unemploy_rate > 0.5  ~ "high",
    unemploy_rate < 0.5  ~ "low"))

housingsubsidy %>%
  dplyr::select(y, inflation_level, unemployment_level, Jobs_Grouped, Contact, Age, Season) %>%
  gather(Variable, value, -y) %>%
  count(Variable, value, y) %>%
  ggplot(., aes(value, n, fill = y)) +   
  geom_bar(position = "dodge", stat="identity") +
  facet_wrap(~Variable, scales="free") +
  scale_fill_manual(values = paletteV4) +
  labs(title = "Feature Associations With Likelihood Of Accepting Subsidy",
       subtitle = "Feature Engineering",
       x = "Accepted Credit",
       y = "Count",
       fill = "Accepted Credit") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
              plot.subtitle = element_text(size = 12,face = "italic"),
              plot.title = element_text(size = 14.5, face = "bold"),
              panel.background = element_blank(),
              panel.border = element_rect(colour = "#C8C8C8", fill=NA, linewidth =0.8))

# Split Housing Subsidy Data Into 65:35 Training Dataset and Test Dataset

set.seed(3456)
trainIndex <- createDataPartition(y = paste(housingsubsidy$job, housingsubsidy$education,housingsubsidy$previous, housingsubsidy$y_numeric), p = .65,
                                  list = FALSE,
                                  times = 1)
subsidytrain <- housingsubsidy[ trainIndex,]
subsidytest  <- housingsubsidy[-trainIndex,]


MODEL EVALUATION

To confirm the homeowner traits most closely linked to the acceptance or refusal of subsidy credits, we created a logistic regression model. This analytical approach details the key factors shaping homeowners’ decisions regarding housing credits, thereby guiding strategies to enhance participation in the program. Our study encompassed two tests: one leveraging the comprehensive historical campaign data known as the “kitchen sink model” and another incorporating our engineered features called “feature engineered”. This comprehensive investigation aims to identify the most effective means of engaging prospective program participants. Additionally, we assessed the accuracy of our models to ensure their reliability and predictive power.


# Split Housing Subsidy Data Into 65:35 Training Dataset and Test Dataset

set.seed(3456)
trainIndex <- createDataPartition(y = paste(housingsubsidy$job, housingsubsidy$education,housingsubsidy$previous, housingsubsidy$y_numeric), p = .65,
                                  list = FALSE,
                                  times = 1)
subsidytrain <- housingsubsidy[ trainIndex,]
subsidytest  <- housingsubsidy[-trainIndex,]

Kitchen Sink Regression

The McFadden score of 0.24 indicates that our logistic regression analysis accounts for 24% of the variance in the homeowners’ decisions regarding housing credits, as represented by the acceptance or refusal of subsidy credits. This finding highlights the moderate explanatory power of the model, underscoring the importance of considering additional variables or refined approaches to enhance its predictive capability.

# Kitchen Sink: All Original Data While Removing Engineered Features

kitchensink <- glm(y_numeric ~ .,
                  data=subsidytrain %>% 
                  dplyr::select(-X, -y,-Season, -Age, -Education,-inflation_level,-unemploy_rate, -Jobs_Grouped),
                  family="binomial" (link="logit"))

summary(kitchensink)
## 
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"), 
##     data = subsidytrain %>% dplyr::select(-X, -y, -Season, -Age, 
##         -Education, -inflation_level, -unemploy_rate, -Jobs_Grouped))
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   2.010e+02  1.210e+02   1.662 0.096574 .  
## age                           1.922e-02  8.172e-03   2.351 0.018710 *  
## jobblue-collar               -3.478e-01  2.614e-01  -1.331 0.183344    
## jobentrepreneur              -4.371e-01  4.416e-01  -0.990 0.322231    
## jobhousemaid                 -1.717e-01  4.511e-01  -0.381 0.703506    
## jobmanagement                -5.268e-01  2.948e-01  -1.787 0.073917 .  
## jobretired                   -4.107e-01  3.607e-01  -1.139 0.254800    
## jobself-employed             -4.172e-01  3.991e-01  -1.045 0.295798    
## jobservices                  -1.314e-01  2.768e-01  -0.475 0.635064    
## jobstudent                    9.007e-02  3.896e-01   0.231 0.817149    
## jobtechnician                -4.623e-02  2.304e-01  -0.201 0.840981    
## jobunemployed                 1.564e-01  3.743e-01   0.418 0.676127    
## jobunknown                   -5.015e-01  6.701e-01  -0.748 0.454229    
## maritalmarried                2.565e-01  2.418e-01   1.061 0.288875    
## maritalsingle                 3.281e-01  2.775e-01   1.182 0.237187    
## maritalunknown               -1.264e+01  4.960e+02  -0.025 0.979663    
## educationbasic.6y             1.644e-01  3.868e-01   0.425 0.670881    
## educationbasic.9y             9.798e-02  3.121e-01   0.314 0.753543    
## educationhigh.school          6.985e-03  3.007e-01   0.023 0.981470    
## educationilliterate          -1.383e+01  1.455e+03  -0.010 0.992418    
## educationprofessional.course  1.737e-01  3.285e-01   0.529 0.596846    
## educationuniversity.degree    7.686e-02  3.018e-01   0.255 0.799008    
## educationunknown              2.192e-01  3.826e-01   0.573 0.566775    
## taxLienunknown                1.037e-01  2.040e-01   0.508 0.611143    
## taxLienyes                   -1.230e+01  1.455e+03  -0.008 0.993259    
## mortgageunknown              -6.580e-01  6.041e-01  -1.089 0.276061    
## mortgageyes                  -4.255e-02  1.367e-01  -0.311 0.755692    
## taxbill_in_phlyes             7.577e-02  1.870e-01   0.405 0.685350    
## contacttelephone             -6.954e-01  2.547e-01  -2.730 0.006329 ** 
## monthaug                     -2.882e-01  3.710e-01  -0.777 0.437247    
## monthdec                      1.705e-01  6.586e-01   0.259 0.795727    
## monthjul                      2.633e-01  3.493e-01   0.754 0.451022    
## monthjun                      7.694e-01  3.295e-01   2.335 0.019527 *  
## monthmar                      1.635e+00  4.928e-01   3.318 0.000906 ***
## monthmay                     -3.816e-01  2.820e-01  -1.353 0.176059    
## monthnov                     -2.806e-01  4.930e-01  -0.569 0.569311    
## monthoct                     -3.355e-01  5.295e-01  -0.634 0.526324    
## monthsep                     -3.823e-01  5.345e-01  -0.715 0.474475    
## day_of_weekmon               -8.050e-02  2.155e-01  -0.374 0.708682    
## day_of_weekthu                1.068e-02  2.173e-01   0.049 0.960809    
## day_of_weektue                3.408e-02  2.168e-01   0.157 0.875092    
## day_of_weekwed                2.042e-01  2.222e-01   0.919 0.358107    
## campaign                     -9.270e-02  4.266e-02  -2.173 0.029794 *  
## pdays                        -1.582e-01  9.245e-02  -1.711 0.087084 .  
## previous                      4.937e-04  1.760e-01   0.003 0.997762    
## poutcomenonexistent           4.839e-01  2.987e-01   1.620 0.105260    
## poutcomesuccess               9.183e-01  7.075e-01   1.298 0.194283    
## cons.price.idx                9.853e-02  4.536e-01   0.217 0.828044    
## cons.conf.idx                 3.152e-02  2.849e-02   1.106 0.268615    
## inflation_rate               -6.160e-02  4.513e-01  -0.136 0.891434    
## spent_on_repairs             -1.047e-02  8.159e-03  -1.283 0.199567    
## ContactOne Week              -1.564e+02  9.185e+01  -1.703 0.088640 .  
## ContactTwo Weeks             -1.560e+02  9.125e+01  -1.709 0.087428 .  
## unemployment_levellow        -1.439e-01  6.593e-01  -0.218 0.827183    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2089.8  on 2812  degrees of freedom
## Residual deviance: 1584.2  on 2759  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 1692.2
## 
## Number of Fisher Scoring iterations: 14
kable(pR2(kitchensink),
       caption = "Optimal Threshold") %>% kable_styling()
## fitting null model for pseudo-r2
Optimal Threshold
x
llh -792.1034505
llhNull -1044.9218220
G2 505.6367429
McFadden 0.2419496
r2ML 0.1645209
r2CU 0.3138043
testProbs <- data.frame(Outcome = as.factor(subsidytest$y_numeric),
                        Probs = predict(kitchensink, subsidytest, type= "response"))

ggplot(testProbs, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = paletteV4) +
  labs(x = "Accepted Subsidy", y = "Density of Probabilities",
       title = "Distribution of Predicted Probabilities by Observed Outcome") +
  theme_bw()+
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1),
        plot.subtitle = element_text(size = 12,face = "italic"),
        plot.title = element_text(size = 14.5, face = "bold"))

Kitchen Sink Confusion Matrix

The confusion matrix reveals the model’s predictive performance in categorizing instances into two classes, “0” and “1,” representing the refusal and acceptance of subsidy credits, respectively.

testProbskit <- testProbs %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs$Probs > 0.5 , 1, 0)))

caret::confusionMatrix(testProbskit$predOutcome, testProbskit$Outcome, 
                       positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1181   88
##          1   18   18
##                                          
##                Accuracy : 0.9188         
##                  95% CI : (0.9026, 0.933)
##     No Information Rate : 0.9188         
##     P-Value [Acc > NIR] : 0.5258         
##                                          
##                   Kappa : 0.2215         
##                                          
##  Mcnemar's Test P-Value : 2.058e-11      
##                                          
##             Sensitivity : 0.16981        
##             Specificity : 0.98499        
##          Pos Pred Value : 0.50000        
##          Neg Pred Value : 0.93065        
##              Prevalence : 0.08123        
##          Detection Rate : 0.01379        
##    Detection Prevalence : 0.02759        
##       Balanced Accuracy : 0.57740        
##                                          
##        'Positive' Class : 1              
## 

The model demonstrates a high specificity (0.98) in correctly identifying homeowners who are unlikely to accept the subsidy. However, its sensitivity (0.17) is relatively low, indicating a need for improvement in recognizing potential participants. Overall accuracy stands at 0.92, suggesting a generally reliable performance in distinguishing acceptance and refusal cases within the housing subsidy dataset. Further adjustments to enhance sensitivity could provide a more comprehensive understanding of the factors influencing homeowner decisions regarding the subsidy.

Engineered Features Regression

In the case of the feature engineered model, the McFadden score has improved to 0.25, suggesting that our engineered model explains 25% of the variability in homeowners’ decisions regarding housing credits, specifically the acceptance or refusal of subsidy credits.

# Engineered Model: Manipulated Features and Extracting Original Data

regressionengineered <- glm(y_numeric ~ .,
                  data=subsidytrain %>% dplyr::select(-X, -y),
                  family="binomial" (link="logit")) %>%
                  na.omit() 
                  
summary(regressionengineered)
## 
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"), 
##     data = subsidytrain %>% dplyr::select(-X, -y))
## 
## Coefficients: (10 not defined because of singularities)
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   7.466e+01  1.607e+02   0.464 0.642308    
## age                           1.088e-02  1.279e-02   0.850 0.395189    
## jobblue-collar               -3.523e-01  2.624e-01  -1.343 0.179327    
## jobentrepreneur              -4.465e-01  4.422e-01  -1.010 0.312557    
## jobhousemaid                 -1.622e-01  4.497e-01  -0.361 0.718330    
## jobmanagement                -5.676e-01  2.967e-01  -1.913 0.055761 .  
## jobretired                   -3.982e-01  3.800e-01  -1.048 0.294664    
## jobself-employed             -4.017e-01  3.999e-01  -1.004 0.315222    
## jobservices                  -1.302e-01  2.782e-01  -0.468 0.639721    
## jobstudent                    1.407e-01  4.158e-01   0.338 0.735042    
## jobtechnician                -2.831e-02  2.315e-01  -0.122 0.902669    
## jobunemployed                 1.108e-01  3.781e-01   0.293 0.769484    
## jobunknown                   -5.143e-01  6.708e-01  -0.767 0.443231    
## maritalmarried                2.567e-01  2.425e-01   1.058 0.289883    
## maritalsingle                 3.255e-01  2.790e-01   1.167 0.243394    
## maritalunknown               -1.264e+01  4.959e+02  -0.025 0.979664    
## educationbasic.6y             1.413e-01  3.881e-01   0.364 0.715733    
## educationbasic.9y             7.772e-02  3.117e-01   0.249 0.803082    
## educationhigh.school         -1.673e-02  3.006e-01  -0.056 0.955621    
## educationilliterate          -1.416e+01  1.455e+03  -0.010 0.992240    
## educationprofessional.course  1.350e-01  3.281e-01   0.411 0.680777    
## educationuniversity.degree    5.159e-02  3.010e-01   0.171 0.863937    
## educationunknown              1.765e-01  3.815e-01   0.463 0.643545    
## taxLienunknown                1.075e-01  2.038e-01   0.528 0.597799    
## taxLienyes                   -1.218e+01  1.455e+03  -0.008 0.993323    
## mortgageunknown              -6.475e-01  6.045e-01  -1.071 0.284163    
## mortgageyes                  -1.945e-02  1.374e-01  -0.142 0.887412    
## taxbill_in_phlyes             6.161e-02  1.875e-01   0.329 0.742399    
## contacttelephone             -8.666e-01  2.784e-01  -3.112 0.001856 ** 
## monthaug                      8.984e-02  4.198e-01   0.214 0.830525    
## monthdec                     -9.637e-02  7.451e-01  -0.129 0.897094    
## monthjul                     -1.220e-01  4.010e-01  -0.304 0.760949    
## monthjun                      7.418e-02  4.517e-01   0.164 0.869557    
## monthmar                      1.958e+00  5.400e-01   3.626 0.000288 ***
## monthmay                     -3.740e-01  3.133e-01  -1.194 0.232656    
## monthnov                     -7.872e-01  5.925e-01  -1.329 0.183989    
## monthoct                     -3.911e-01  5.763e-01  -0.679 0.497373    
## monthsep                      1.920e-01  6.224e-01   0.308 0.757750    
## day_of_weekmon               -4.934e-02  2.163e-01  -0.228 0.819561    
## day_of_weekthu                2.046e-02  2.183e-01   0.094 0.925328    
## day_of_weektue                5.717e-02  2.175e-01   0.263 0.792667    
## day_of_weekwed                2.554e-01  2.237e-01   1.142 0.253461    
## campaign                     -8.943e-02  4.257e-02  -2.101 0.035652 *  
## pdays                        -1.510e-01  9.237e-02  -1.634 0.102198    
## previous                     -2.677e-02  1.733e-01  -0.154 0.877253    
## poutcomenonexistent           4.954e-01  2.984e-01   1.660 0.096961 .  
## poutcomesuccess               8.941e-01  7.078e-01   1.263 0.206532    
## unemploy_rate                -1.209e+00  5.228e-01  -2.313 0.020707 *  
## cons.price.idx                1.029e+00  8.631e-01   1.193 0.233005    
## cons.conf.idx                -5.203e-02  6.091e-02  -0.854 0.393005    
## inflation_rate               -1.394e+00  9.481e-01  -1.470 0.141440    
## spent_on_repairs             -3.323e-03  1.052e-02  -0.316 0.752100    
## SeasonQuarter 02                     NA         NA      NA       NA    
## SeasonQuarter 03                     NA         NA      NA       NA    
## SeasonQuarter 04                     NA         NA      NA       NA    
## AgeGeneration X              -8.347e-02  2.300e-01  -0.363 0.716700    
## AgeGeneration Z              -7.278e-01  8.844e-01  -0.823 0.410510    
## AgeMillenials                -2.341e-01  3.293e-01  -0.711 0.477144    
## EducationHigher Education            NA         NA      NA       NA    
## EducationK Through Middle            NA         NA      NA       NA    
## EducationLow Or Uneducated           NA         NA      NA       NA    
## ContactOne Week              -1.492e+02  9.176e+01  -1.626 0.103967    
## ContactTwo Weeks             -1.487e+02  9.117e+01  -1.631 0.102788    
## Jobs_GroupedSelf-employed            NA         NA      NA       NA    
## Jobs_GroupedUnknown                  NA         NA      NA       NA    
## Jobs_GroupedWhite Collar             NA         NA      NA       NA    
## Jobs_GroupedWorking Class            NA         NA      NA       NA    
## inflation_levellow           -5.795e+00  3.492e+00  -1.659 0.097086 .  
## unemployment_levellow        -2.005e+00  1.246e+00  -1.608 0.107790    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2089.8  on 2812  degrees of freedom
## Residual deviance: 1577.1  on 2754  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 1695.1
## 
## Number of Fisher Scoring iterations: 14
kable(pR2(regressionengineered),
       caption = "Optimal Threshold") %>% kable_styling()
## fitting null model for pseudo-r2
Optimal Threshold
x
llh -788.5384813
llhNull -1044.9218220
G2 512.7666813
McFadden 0.2453613
r2ML 0.1666359
r2CU 0.3178383
testProbsfeature <- data.frame(Outcome = as.factor(subsidytest$y_numeric),
                        Probs = predict(regressionengineered, subsidytest, type= "response"))

ggplot(testProbsfeature, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = paletteV4) +
  labs(x = "Accepted Subsidy", y = "Density of Probabilities",
       title = "Distribution of Predicted Probabilities by Observed Outcome") +
  theme_bw()+
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1),
        plot.subtitle = element_text(size = 12,face = "italic"),
        plot.title = element_text(size = 14.5, face = "bold"))

Engineered Features Confusion Matrix

testProbsfeature_conf <- 
  testProbsfeature %>%
  mutate(predOutcome  = as.factor(ifelse(testProbsfeature$Probs >  0.5 , 1, 0)))

caret::confusionMatrix(testProbsfeature_conf$predOutcome, testProbsfeature_conf$Outcome, 
                       positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1177   88
##          1   22   18
##                                           
##                Accuracy : 0.9157          
##                  95% CI : (0.8993, 0.9302)
##     No Information Rate : 0.9188          
##     P-Value [Acc > NIR] : 0.6798          
##                                           
##                   Kappa : 0.2115          
##                                           
##  Mcnemar's Test P-Value : 5.736e-10       
##                                           
##             Sensitivity : 0.16981         
##             Specificity : 0.98165         
##          Pos Pred Value : 0.45000         
##          Neg Pred Value : 0.93043         
##              Prevalence : 0.08123         
##          Detection Rate : 0.01379         
##    Detection Prevalence : 0.03065         
##       Balanced Accuracy : 0.57573         
##                                           
##        'Positive' Class : 1               
## 

The model’s specificity of 0.98 signifies its strong capability in correctly identifying individuals who are unlikely to accept the housing subsidy. Conversely, its sensitivity of 0.17 indicates the need for improved recognition of potential participants. The overall accuracy of 0.92 demonstrates the model’s consistent performance in distinguishing cases of acceptance and refusal within the housing subsidy dataset. Further refinements to enhance the model’s sensitivity could yield a more comprehensive understanding of the underlying factors influencing homeowner decisions regarding the subsidy.


ROC CURVE


The ROC curve for the kitchen sink and feature engineered models show a reliable ability to distinguish between positive and negative cases, as evident from the AUC score of 0.73. This high AUC value suggests that the feature engineered model performs notably better than random guessing in identifying positive cases (e.g., homeowners likely to accept the subsidy) and negative cases (e.g., homeowners unlikely to accept the subsidy).

ggplot(testProbsfeature_conf, aes(d = as.numeric(Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#949494") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = "#E00000") +
  labs(title = "ROC Curve - Kitchen Sink Model") +
  theme_bw()+
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1),
        plot.subtitle = element_text(size = 12,face = "italic"),
        plot.title = element_text(size = 14.5, face = "bold"))

kbl(auc(testProbsfeature_conf$Outcome, testProbsfeature_conf$Probs), caption = "Optimal Threshold") %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Optimal Threshold
x
0.7262656
ggplot(testProbskit, aes(d = as.numeric(Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#E00000") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve - Feature Engineered Model") +
  theme_bw()+
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1),
        plot.subtitle = element_text(size = 12,face = "italic"),
        plot.title = element_text(size = 14.5, face = "bold"))

kbl(auc(testProbskit$Outcome, testProbskit$Probs), caption = "Optimal Threshold") %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")
Optimal Threshold
x
0.7318205


CROSS VALIDATION


The kitchen sink model’s strong performance in differentiating households likely to accept or reject the housing subsidy. With consistently high ROC score of 0.77 and specificity and sensitivity displaying tight distributions, the model exhibits reliable predicting capabilities. Moreover, its stable performance across various cross-validation folds suggests robust generalizability and an avoidance of overfitting to the training data.

ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

cvFit_kit <- train(y ~ .,
                  data=housingsubsidy %>% 
                    dplyr::select(-X, -y_numeric,-Season, -Age, -Education, -Contact, -inflation_level,-unemployment_level, -Jobs_Grouped), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)

cvFit_kit
## Generalized Linear Model 
## 
## 4119 samples
##   19 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 4077, 4078, 4077, 4078, 4077, 4077, ... 
## Resampling results:
## 
##   ROC        Sens       Spec 
##   0.7612564  0.9814264  0.215
dplyr::select(cvFit_kit$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit_kit$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#E00000") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#000000", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="Kitchen Sink CV Goodness of Fit Metrics",
         subtitle = "Across-Fold Mean Represented as Dotted Lines") +
    theme_bw() +
    theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1),
        plot.subtitle = element_text(size = 12,face = "italic"),
        plot.title = element_text(size = 14.5, face = "bold"))

The feature engineered model also performs strong in differentiating households likely to accept or reject the housing subsidy. With consistently high ROC scores at 0.77 and specificity and sensitivity displaying tight distributions, the model exhibits reliable predicting capabilities. Moreover, its stable performance across various cross-validation folds again suggests generalizability and avoidance of overfitting.

ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

cvFit_feat <- train(y ~ .,
                  data=housingsubsidy %>% 
                    na.omit() %>%
                    dplyr::select(-X, -y_numeric, -month, -age, -marital), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)

cvFit_feat
## Generalized Linear Model 
## 
## 4118 samples
##   23 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 4076, 4077, 4076, 4076, 4076, 4077, ... 
## Resampling results:
## 
##   ROC        Sens       Spec  
##   0.7662237  0.9809234  0.2225
dplyr::select(cvFit_feat$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit_feat$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#E00000") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#000000", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="Feature Engineered CV Goodness of Fit Metrics",
         subtitle = "Across-Fold Mean Represented as Dotted Lines") +
    theme_bw() +
    theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1),
        plot.subtitle = element_text(size = 12,face = "italic"),
        plot.title = element_text(size = 14.5, face = "bold"))


COST BENEFIT ANALYSIS


To conclude, we conduct a comprehensive cost-benefit analysis aimed at optimizing resource allocation and strategically guiding our subsidy campaign. Ultimately, our objective is establishing a threshold for categorizing individuals as probable candidates of accepting the tax credit. In this process, we approach the calculations from the perspective of the Department of Housing and Community Development, ensuring a delicate balance between cost savings and enhancing program participation. This approach will help us make informed decisions and maximize the impact of our tax credit initiative.


Our four assumptions when creating the cost-benefit analysis include:

  • 1. The Premium Per House That Accepts The Subsidy is $10,000.

  • 2. There Is A Marketing Cost Of $2,850 For Each Contact.

  • 3. A $5,000 Subsidy Is Awarded Per House That Accepts.

  • 4. The Percentage Of Homeowners Who Accept The Subsidy Is 25%.


To calculate the cost-benefit analysis, we use the following equations and methods:

True Negative: Predicted correctly the homeowner would not take the credit; no marketing resources were allocated; and no credit was allocated. Therefore, the equation would be (Count * 0).

True Positive: Predicted correctly the homeowner would take the credit; allocated the marketing resources; and 25% accepted the credit.” Thus, we calculate using ((Count * .25) * 58150).

False Negative: “Predicted the homeowner would not take the credit but they did; suggesting they signed up for reasons unrelated to the marketing campaign.” Our equation is (Count * 0).

False Positive: Predicted incorrectly the homeowner would take the credit; allocated marketing resources; no credit allocated.” Therefore, we would use (Count * -2850).

#Creating A Cost Benefit Table

cost_benefit_table <-
   testProbsfeature_conf %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
      gather(Variable, Count) %>%
      mutate(Revenue =
                case_when(Variable == "True_Negative"  ~ Count * 0,  
                         Variable == "True_Positive"  ~ ((Count * .25) * 58150),  
                         Variable == "False_Negative" ~ Count * 0,
                         Variable == "False_Positive" ~ (Count * -2850))) %>%
      bind_cols(data.frame(Description = c(
               "Predicted correctly homeowner would not take the credit; no marketing resources were allocated; and no credit was allocated.",
               "Predicted correctly homeowner would take the credit; allocated the marketing resources; and 25% accepted the credit.",
               "Predicted that a homeowner would not take the credit but they did; signed up for reasons unrelated to the marketing campaign.",
               "Predicted incorrectly homeowner would take the credit; allocated marketing resources; no credit allocated.")))

kable(cost_benefit_table) %>% 
  kable_styling(font_size = 12, full_width = F,  
                bootstrap_options = c("striped", "hover", "condensed")) %>%
  footnote(general_title = "\n",
           general = "")
Variable Count Revenue Description
True_Negative 1177 0 Predicted correctly homeowner would not take the credit; no marketing resources were allocated; and no credit was allocated.
True_Positive 18 261675 Predicted correctly homeowner would take the credit; allocated the marketing resources; and 25% accepted the credit.
False_Negative 88 0 Predicted that a homeowner would not take the credit but they did; signed up for reasons unrelated to the marketing campaign.
False_Positive 22 -62700 Predicted incorrectly homeowner would take the credit; allocated marketing resources; no credit allocated.

Even with the incorporation of our algorithm to enhance outreach, the Department of Housing and Community Development (HCD) incurs costs that exceed the direct revenue it generates. This outreach initiative contacts a significant number of homeowners, resulting in substantial expenses. While some homeowners accept the subsidy, leading to the distribution of credits, the overall financial outcome remains negative. It’s crucial to highlight that HCD’s mission is primarily focused on public welfare, where financial cost-benefit represents just one aspect of its goals. This model does not consider positive externalities, such as increased property values for neighboring homes and improved quality of life for homeowners. Moreover, employing this algorithm significantly reduces costs for HCD in comparison to reaching out to all eligible homeowners without optimization. Taking a comprehensive view of the cost-benefit analysis, the cost of not implementing a model or algorithm becomes significant. Financial aspects are complemented by a range of societal benefits. The HCD would want to prioritize not only profit maximization but also the broader welfare of the community.

The decision-making process should evaluate both business and public interest perspectives. From a business standpoint, a housing subsidy or credit program offers numerous advantages. Firstly, it can stimulate demand in the real estate market, boosting home sales and construction. This increased activity benefits developers and builders while contributing to overall economic growth and job creation in the construction sectors. Additionally, these programs have the potential to enhance property values, promote neighborhood revitalization, and create vibrant communities. This, in turn, can positively impact local businesses by increasing foot traffic and expanding their customer base. Overall, a housing subsidy program can yield both direct and indirect economic benefits, making it a valuable consideration from a business perspective.

When examining public interest, we recognize the positive impact of our credit program on property values and community well-being. These holistic considerations play a crucial role in the comprehensive evaluation of each credit’s direct and indirect contributions, ultimately influencing our optimization strategy. These factors also have a profound influence on variable engineering and threshold selection in our model. Notably, the cost associated with false negatives significantly exceeds that of false positives. While a revenue of $0 may indicate a financial break-even for the department, it’s evident that true positives and false positives incur substantial costs. Finding the right balance between financial sustainability and societal benefit is the critical challenge.

iterateThresholds <- function(data) {
  x = .01
  all_prediction <- data.frame()
  while (x <= 1) {
  
  this_prediction <-
      testProbsfeature_conf %>%
      mutate(predOutcome = ifelse(Probs > x, 1, 0)) %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
     gather(Variable, Count) %>%
     mutate(Revenue =
               ifelse(Variable == "True_Negative", Count * 0,
               ifelse(Variable == "True_Positive", (-2850 + 0.25 *(10000 + 56000 -5000)) * Count,
               ifelse(Variable == "False_Negative", Count * 0,
               ifelse(Variable == "False_Positive", (-2850) * Count, 0)))),
            Threshold = x)
  
  all_prediction <- rbind(all_prediction, this_prediction)
  x <- x + .01
  }
return(all_prediction)
}
whichThreshold <- iterateThresholds(testProbsfeature_conf)

whichThreshold_revenue <- 
whichThreshold %>% 
    group_by(Threshold) %>% 
    summarize(Revenue = sum(Revenue))

whichThreshold %>%
  ggplot(.,aes(Threshold, Revenue, colour = Variable)) +
  geom_point() +
  scale_colour_manual(values = paletteV1[c(5, 1:3)]) +    
  labs(title = "Revenue by Confusion Matrix Type and Threshold",
       y = "Revenue") +
  guides(colour = guide_legend(title = "")) +
  theme_bw() +
  theme(legend.position = "top",
        axis.text.x = element_text(angle = 45, hjust = 1),
        plot.subtitle = element_text(size = 12,face = "italic"),
        plot.title = element_text(size = 14.5, face = "bold"))

The graph above illustrates the correlation between revenue and the threshold of the home repair subsidy program. In general, as the threshold increases, revenue would decrease as more households qualify for assistance. Thus, leading to an increased allocation of funds by the Department of Housing and Community Development. However, the revenue’s sensitivity to the type of confusion matrix is noteworthy. False positives, where marketing funding is granted without no credit utilization, result in a more gradual decline in revenue. This is in contrast to the sharp decline associated with false negatives. The choice of an optimal threshold hinges on the program’s objectives with a lower threshold maximizing the number of subsidy recipients and a higher one minimizing program costs. A more in-depth analysis reveals that false negatives can result in no repairs with no impact on neighboring property values, whereas true positives benefit both households and the HCD by offering subsidy assistance and boosting property values in the community.

In both the following graphs, the vertical line marks the optimal threshold, signifying the point where the optimal balance between revenue and allocation is achieved. Higher threshold values would lead to decreased revenue and allocation of credits. The optimal threshold value in both graphs tends to hover around 0.50. We observe a clear trend of revenue stabilizing consistently at the threshold value of 0.21. This finding highlights the strategy of concentrating credit allocation within the 0.21 to 1 threshold range for optimal departmental budget optimization. The trend’s consistency is reflected in the relationship between the threshold and the total count of credits. We determine the total count of credits is calculated as 0.25 * True Positives + False Negatives - helping one note that as the threshold value increases, the total count of credits continues to rise.

whichThreshold_revenue <- 
  whichThreshold %>% 
    mutate(TookCredit = ifelse(Variable == "True_Positive", (Count * .25),
                         ifelse(Variable == "False_Negative", Count, 0))) %>%
  group_by(Threshold) %>% 
    summarize(Total_Revenue = sum(Revenue),
              Total_Count_Of_Credits = sum(TookCredit))
grid.arrange(
  ncol = 1,
  ggplot(whichThreshold_revenue) +
    geom_line(aes(x = Threshold, y = Total_Revenue, color = "red")) +  
    geom_vline(xintercept = pull(arrange(whichThreshold_revenue, -Total_Revenue)[1, 1]), color = "Black") +
    labs(title = "Figure Number: Total Revenues By Threshold",
         subtitle = "Vertical Line Denotes Optimal Threshold") +
    scale_color_manual(values = c("red" = "red")) +
    theme_bw() +
    theme(legend.position = "none",
          axis.text.x = element_text(angle = 45, hjust = 1),
          plot.subtitle = element_text(size = 12, face = "italic"),
          plot.title = element_text(size = 14.5, face = "bold")),

  ggplot(whichThreshold_revenue) +
    geom_line(aes(x = Threshold, y = Total_Count_Of_Credits, color = "red")) +  
    geom_vline(xintercept = pull(arrange(whichThreshold_revenue, -Total_Count_Of_Credits)[1, 1]), color = "Black")+
    labs(title = "Figure Number: Total Count of Credits By Threshold",
         subtitle = "Vertical Line Denotes Optimal Threshold") +
    scale_color_manual(values = c("red" = "red")) +
    theme_bw() + 
    theme(legend.position = "none",
          axis.text.x = element_text(angle = 45, hjust = 1),
          plot.subtitle = element_text(size = 12, face = "italic"),
          plot.title = element_text(size = 14.5, face = "bold")))

The table provides a comparative analysis between the optimal threshold of 0.21 and the 0.50 threshold, emphasizing substantial variations in outcomes. Opting for the optimal threshold results in the highest revenue with the HCD awarding 71.5 repair credits, equivalent to a revenue of $325,300. In contrast, the 0.50 threshold yields significantly lower revenue, totaling $160,500.

Further, this table illustrates the trade-off between the total count of credits and the total revenue. To boost revenue, the department would need to increase the number of credits extended. However, it’s essential to exercise caution, as excessive credit allocation will lead to financial losses. Ultimately, the optimal threshold for the total count of credits should be tailored to the specific mission of HCD and cost structure (or financial motives) of the entity.

optimalthreshold <- whichThreshold_revenue %>%
  mutate(Threshold = Threshold * 100) %>%
  dplyr::select(Threshold, Total_Revenue, Total_Count_Of_Credits)

filtered_data <- optimalthreshold %>%
  filter(Threshold == '21' | Threshold == '50')

kable(filtered_data,
       caption = "Optimal Threshold") %>% kable_styling()
Optimal Threshold
Threshold Total_Revenue Total_Count_Of_Credits
21 325300 71.5
50 160500 92.5


CONCLUSION


While the model’s predictive abilities regarding homeowner acceptance of housing subsidies may not be fully proficient, it should still be regarded as a valuable starting point for future exploration and refinement. The model’s commendable accuracy, demonstrated generalizability, and its ability to reliably identify individuals unlikely to accept the housing subsidy contribute to the establishment of a strong foundational framework for predictive analysis in the domain of housing subsidies. Moreover, through further data gathering and refinement of the model’s feature, a deeper and more comprehensive understanding of the intricate factors influencing homeowner decisions regarding the subsidy can be achieved so HED can run a more efficient marketing campaign.

Upon further reflection and considering our initial observations, we recommend the Department of Housing and Community Development enhance marketing efforts involving: reaching out to homeowners via mobile phones, connecting with individuals in administrative roles, targeting married individuals, and intensifying marketing campaigns during periods of economic downturn and high unemployment rates.

In summary, the analysis of Emil City’s Housing and Community Development Department’s home repair tax credit program has provided valuable insights into its effectiveness and factors influencing homeowner participation with a record of boosting property values and community benefits. Key findings include the influence of prior contact, economic stability, and categorical variables on subsidy acceptance. A balanced approach considering both financial and societal benefits is also crucial. The department should refine its outreach strategy focusing on likely subsidy recipients while maintaining a balance between cost savings and community welfare, necessitating ongoing program parameter monitoring and adjustment for long-term success.