setwd("C:\\Users\\M.Mando\\Desktop\\M3-data analytics")
data <- read.csv("bank-full.csv",header = TRUE, sep = ";")
attach(data)
head(data)

Deliverable 1 - One paragraph on the motivation for the analysis.

The Emil City HCD seeks to improve the uptake of the home repair tax credit program. Historically, a broad and unspecific outreach has led to a mere 11% uptake among eligible homeowners. By leveraging data from previous campaigns, we aim to use predictive modeling to target homeowners more effectively, optimizing resource allocation and increasing the overall benefit to the community

library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(caret)
## Loading required package: lattice
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.1-8
library(caTools)
library(ROCR)

Deliverable 2 - Develop and interpret data visualizations that describe feature importance/correlation.

Correlation Matrix Visualization

# Correlation Matrix Visualization
corr_matrix <- cor(data[, sapply(data, is.numeric)])
corrplot(corr_matrix, method = "circle")

The correlations between these variables are generally weak, meaning that there is no strong linear relationship between them.

# Define a color palette
palette2 <- scales::hue_pal()(2)
data %>%
  select(y, age, balance) %>%
  gather(Variable, value, -y) %>%
  ggplot(aes(y, value, fill = y)) + 
    geom_bar(position = "dodge", stat = "summary", fun = "mean") + 
    facet_wrap(~Variable, scales = "free") +
    scale_fill_manual(values = palette2) +
    labs(x = "Take the credit", y = "Mean", 
         title = "Feature associations with the decision to take the credit (Yes/No)",
         subtitle = "(Continuous outcomes)") +
    theme(legend.position = "none")

data %>%
  select(y, job, marital, education, default, housing, loan, contact, month, poutcome) %>%
  pivot_longer(cols = c(job, marital, education, default, housing, loan, contact, month, poutcome), 
               names_to = "Variable", values_to = "value") %>%
  count(Variable, value, y) %>%
  filter(value == "yes") %>%
  ggplot(aes(y, n, fill = y)) +   
    geom_bar(position = "dodge", stat = "identity") +
    facet_wrap(~Variable, scales = "free", ncol = 3) +
    scale_fill_manual(values = c("yes" = "blue", "no" = "red")) +
    labs(x = "Take the credit", y = "Count",
         title = "Feature associations with the decision to take the credit (Yes/No)",
         subtitle = "Categorical features (Yes and No)") +
    theme(legend.position = "none")

data %>%
  select(y, job, marital, education, default, contact, month, poutcome) %>%
  pivot_longer(cols = c(job, marital, education, contact, month, poutcome), 
               names_to = "Variable", values_to = "value") %>%
  count(Variable, value, y) %>%
  filter(!is.na(value)) %>%
  ggplot(aes(value, n, fill = y)) +   
    geom_bar(position = "dodge", stat = "identity") +
    facet_wrap(~Variable, scales = "free") +
    scale_fill_manual(values = c("yes" = "blue", "no" = "red")) + 
    labs(x = "Value", y = "Count",
         title = "Feature associations with the decision to take the credit (Yes/No)",
         subtitle = "Three-category features or more") +
    theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))

These bar charts indicate that the dataset is imbalanced, with “no” responses being the predominant class. We can see how categorical variables are related to the decision to take the credit, even though “yes” responses are less frequent. Further analysis and feature engineering are required to assess these categorical features’ statistical significance and predictive power.

Deliverable 3: Split your data into a 65/35 training/test set

Kitchen Sink Data (df, train_df, test_df )

# Encode categorical variables using one-hot encoding
df <- data %>%
  mutate_if(is.character, as.factor) %>%
  select_if(is.factor) %>%
  mutate_all(funs(as.numeric(.)))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
set.seed(123)
split <- sample.split(df$y, SplitRatio = 0.65)
train_df <- subset(df, split == TRUE)
test_df <- subset(df, split == FALSE)

Deliverable 4: The Sensitivity (True Positive Rate) for a model with all the features is very low. Engineer new features that significantly increase the Sensitivity

# Binning age into age groups
df1 <- df %>%
  mutate(age_group = cut(age, breaks = c(0, 30, 50, Inf), labels = c("young", "middle-aged", "senior")))

# Perform feature scaling on numeric variables (e.g., z-score standardization)
numeric_vars <- c("age", "balance", "duration", "campaign", "pdays", "previous")
df1[numeric_vars] <- scale(data[numeric_vars])

# Create interaction terms (example: housing_loan_interaction)
df1 <- df1 %>%
  mutate(
    housing_loan_interaction = housing * loan,
    education_job_interaction = education * job,
    balance_duration_ratio = balance / duration,
    balance_to_age_ratio = balance / age
  )

Deliverable 4(a) Interpret your new features in one paragraph.

  • housing_loan_interaction represents an interaction between housing and loan, capturing the combined effects of both features, which may indicate a higher likelihood of homeowners taking the credit

  • education_job_interaction represents an interaction between education and job, this can be useful in exploring whether certain combinations of education and job types have a unique influence on the data.

  • balance_duration_ratio represents the ratio of the bank balance to the duration of the last contact. It can provide insights into how a customer’s financial stability relates to the length of the contact and potentially their decision-making.

  • balance_to_age_ratio represents the ratio of bank balance to age. Which might help in understanding how financial resources are distributed relative to a person’s age, which could be relevant for analyzing customer behavior.

Split your data into a 65/35 training/test set

Engineered Data (df1, train, test)

set.seed(123)
split <- sample.split(df1$y, SplitRatio = 0.65)
train <- subset(df1, split == TRUE)
test <- subset(df1, split == FALSE)

Deliverable 4(b) : Show a regression summary for both the kitchen sink and your engineered regression.

# Kitchen Sink Model: Using all available features
kitchen_sink_model <- lm(y ~ ., data = train_df)

# Engineered Regression Model: Using selected features including engineered ones
engineered_model <- lm(y ~ . , data = train)

# Display regression summaries
summary(kitchen_sink_model)
## 
## Call:
## lm(formula = y ~ ., data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3271 -0.1584 -0.1026 -0.0300  1.0684 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.3694913  0.0199794  68.545  < 2e-16 ***
## job          0.0009618  0.0005725   1.680   0.0930 .  
## marital      0.0148599  0.0030376   4.892 1.00e-06 ***
## education    0.0156917  0.0025144   6.241 4.42e-10 ***
## default     -0.0349005  0.0140066  -2.492   0.0127 *  
## housing     -0.0839684  0.0039090 -21.481  < 2e-16 ***
## loan        -0.0525500  0.0049803 -10.552  < 2e-16 ***
## contact     -0.0444409  0.0023362 -19.023  < 2e-16 ***
## month        0.0063292  0.0006755   9.370  < 2e-16 ***
## poutcome    -0.0184588  0.0019560  -9.437  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3136 on 29377 degrees of freedom
## Multiple R-squared:  0.04804,    Adjusted R-squared:  0.04775 
## F-statistic: 164.7 on 9 and 29377 DF,  p-value: < 2.2e-16
summary(engineered_model)
## 
## Call:
## lm(formula = y ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.95442 -0.13795 -0.05951  0.02123  1.10562 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.392e+00  2.891e-02  48.149  < 2e-16 ***
## job                        2.977e-04  1.690e-03   0.176 0.860203    
## marital                    1.321e-02  3.065e-03   4.311 1.63e-05 ***
## education                  1.634e-02  4.411e-03   3.705 0.000212 ***
## default                   -1.481e-02  1.276e-02  -1.161 0.245769    
## housing                   -1.667e-01  1.134e-02 -14.701  < 2e-16 ***
## loan                      -1.636e-01  1.540e-02 -10.625  < 2e-16 ***
## contact                   -3.828e-02  2.132e-03 -17.957  < 2e-16 ***
## month                      4.897e-03  6.191e-04   7.910 2.67e-15 ***
## poutcome                   2.202e-02  3.374e-03   6.525 6.93e-11 ***
## age_groupmiddle-aged      -6.713e-02  5.971e-03 -11.242  < 2e-16 ***
## age_groupsenior           -8.305e-02  1.073e-02  -7.739 1.04e-14 ***
## age                        2.513e-02  3.396e-03   7.401 1.39e-13 ***
## balance                    7.659e-03  1.693e-03   4.523 6.11e-06 ***
## duration                   1.232e-01  1.662e-03  74.118  < 2e-16 ***
## campaign                  -9.126e-03  1.685e-03  -5.415 6.19e-08 ***
## pdays                      4.065e-02  3.245e-03  12.529  < 2e-16 ***
## previous                   1.476e-02  1.725e-03   8.557  < 2e-16 ***
## housing_loan_interaction   7.321e-02  9.197e-03   7.960 1.79e-15 ***
## education_job_interaction  2.105e-04  7.526e-04   0.280 0.779696    
## balance_duration_ratio     1.765e-05  1.952e-05   0.905 0.365688    
## balance_to_age_ratio       1.016e-05  7.015e-05   0.145 0.884840    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.285 on 29365 degrees of freedom
## Multiple R-squared:  0.2142, Adjusted R-squared:  0.2136 
## F-statistic: 381.2 on 21 and 29365 DF,  p-value: < 2.2e-16

Deliverable 4(c) : Cross-validate both models; compare and interpret two facetted plots of ROC, Sensitivity and Specificity.

# Create a control object for cross-validation
ctrl <- trainControl(method = "cv", number = 5)

# Kitchen Sink Model: Using all available features
kitchen_sink_model_cv <- train(y ~ ., data = train_df, method = "lm", trControl = ctrl)
## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.
# Engineered Regression Model: Using selected features including engineered ones
engineered_model_cv <- train(y ~ ., data = train, method = "lm", trControl = ctrl)
## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.
# Display cross-validation results
print(kitchen_sink_model_cv)
## Linear Regression 
## 
## 29387 samples
##     9 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 23510, 23509, 23510, 23510, 23509 
## Resampling results:
## 
##   RMSE       Rsquared    MAE      
##   0.3136448  0.04779809  0.1984907
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
print(engineered_model_cv)
## Linear Regression 
## 
## 29387 samples
##    20 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 23509, 23510, 23509, 23510, 23510 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.2857738  0.2098292  0.1797369
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

RoC Plots Comparison

# Load the pROC package
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# Create ROC curves and calculate Sensitivity and Specificity for both models
roc_kitchen_sink <- roc(train_df$y, predict(kitchen_sink_model, newdata = train_df))
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
roc_engineered <- roc(train$y, predict(engineered_model, newdata = train))
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
# Create facetted plots
par(mfrow=c(1,2))  # Arrange plots side by side

# Plot ROC curve for the "kitchen sink" model
plot(roc_kitchen_sink, main = "ROC Curve - Kitchen Sink Model", col = "blue")
legend("bottomright", legend = paste("AUC =", round(auc(roc_kitchen_sink), 2)), col = "blue")

# Plot ROC curve for the "engineered" model
plot(roc_engineered, main = "ROC Curve - Engineered Model", col = "red")
legend("bottomright", legend = paste("AUC =", round(auc(roc_engineered), 2)), col = "red")

# Reset the plotting layout
par(mfrow=c(1,1))

The”Engineered” model with an AUC of 0.87 demonstrates better predictive performance, while the “Kitchen Sink” model with an AUC of 0.69 performs less effectively in discriminating between the classes. This means that the interactions and ratios added to the model have a substantial role in this discrimination.

Deliverable 5: Output an ROC curve for your new model and interpret it.

TP rate vs FP rate performance

# Predict on test data
pred_all <- predict(kitchen_sink_model, newdata=test_df, type="response")
pred_engineered <- predict(engineered_model, newdata=test, type="response")

# Create prediction objects
predObj_all <- prediction(pred_all, test_df$y)
perfObj_all <- performance(predObj_all, "tpr", "fpr")

predObj_eng <- prediction(pred_engineered, test$y)
perfObj_eng <- performance(predObj_eng, "tpr", "fpr")

# Plot ROC
plot(perfObj_all, colorize=TRUE)
plot(perfObj_eng, add=TRUE, col="red")

RoC New Model - Test Data

When applying both models to the test data, we observe similar results to those obtained on the training datasets. This consistency indicates that the models generalize well to new, unseen data. Importantly, the “Engineered” model continues to demonstrate superior predictive performance, while the “Kitchen Sink” model consistently performs less effectively in distinguishing between the two classes. These findings suggest that the feature-engineered model is a robust and reliable choice for making predictions on both the training and test datasets, highlighting its overall effectiveness in capturing the underlying patterns in the data.

roc_engineered <- roc(test$y, predict(engineered_model, newdata = test))
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
# Plot ROC curve for the "engineered" model
plot(roc_engineered, main = "ROC Curve - Engineered Model", col = "darkgreen")
legend("bottomright", legend = paste("AUC =", round(auc(roc_engineered), 2)), col = "darkgreen")

# Reset the plotting layout
par(mfrow=c(1,1))

Deliverable 6: Develop a cost benefit analysis.

Deliverable 6(a): Write out the cost/benefit equation for each confusion metric.

True Positive (TP): Predicted they would take the credit and they did.

  • Cost: Cost of marketing allocation (-2850) + cost of the credit (-5000)

  • Benefit: Value from the sale of the improved home (+10000) + aggregate premium of homes surrounding the repaired home (+56000).

Equation 1: Benefit(TP) = 10,000 + 56,000 − 2,850 − 5,000

True Negative (TN): Predicted they wouldn’t take the credit and they didn’t.

  • Cost: $0 (no costs incurred)

  • Benefit: $0 (no benefits either)

Equation 2: Benefit(TN) = 0

False Positive (FP): Predicted they would take the credit, but they didn’t.

  • Cost: Cost of marketing allocation (-2850)

  • Benefit: $0 (since they didn’t take the credit)

Equation 3: Benefit(FP) = −2,850

False Negative (FN): Predicted they wouldn’t take the credit, but they did.

  • Cost: $0 (no marketing allocation)

  • Benefit: $0 (since the model didn’t predict them)

Equation 4:Benefit(FN) = 0

Deliverable 6(b): Create the ‘Cost/Benefit Table’ as seen above.

# Define thresholds
thresholds <- seq(0, 1.2, 0.01)

# Initialize vectors to store the calculated values
total_costs <- numeric(length(thresholds))
benefits <- numeric(length(thresholds))
credits <- numeric(length(thresholds))

for (i in 1:length(thresholds)) {
  threshold <- thresholds[i]
  
  pred_labels <- ifelse(pred_engineered > threshold, "Yes", "No")
  
  TP <- sum(pred_labels == "Yes" & test$y == "2") #Yes=2 & No=1
  TN <- sum(pred_labels == "No" & test$y == "1")
  FP <- sum(pred_labels == "Yes" & test$y == "1")
  FN <- sum(pred_labels == "No" & test$y == "2")
  
  # Compute costs
  TP_cost <- TP * -7850
  TN_cost <- 0
  FP_cost <- FP * -2850
  FN_cost <- 0
  
  total_costs[i] <- TP_cost + TN_cost + FP_cost + FN_cost
  
  # Compute benefits
  benefit_TP <- (10000 + 56000 - 2850 - 5000) * TP
  benefit_TN <- 0 * TN
  benefit_FP <- -2850 * FP
  benefit_FN <- 0 * FN
  
  benefits[i] <- benefit_TP + benefit_TN + benefit_FP + benefit_FN
  
  # Compute credits
  credits[i] <- TP + FN
}

# Create a data frame to display the results
cost_benefit_table <- data.frame(
  Threshold = thresholds,
  Total_Costs = total_costs,
  Benefits = benefits,
  Credits = credits,
  Net_Benefit = benefits + total_costs # assuming net benefit is benefits - costs
)

print(cost_benefit_table)
##     Threshold Total_Costs Benefits Credits Net_Benefit
## 1        0.00   -54353400 67812600    1851    13459200
## 2        0.01   -54353400 67812600    1851    13459200
## 3        0.02   -54353400 67812600    1851    13459200
## 4        0.03   -54353400 67812600    1851    13459200
## 5        0.04   -54353400 67812600    1851    13459200
## 6        0.05   -54353400 67812600    1851    13459200
## 7        0.06   -54353400 67812600    1851    13459200
## 8        0.07   -54353400 67812600    1851    13459200
## 9        0.08   -54353400 67812600    1851    13459200
## 10       0.09   -54353400 67812600    1851    13459200
## 11       0.10   -54353400 67812600    1851    13459200
## 12       0.11   -54353400 67812600    1851    13459200
## 13       0.12   -54353400 67812600    1851    13459200
## 14       0.13   -54353400 67812600    1851    13459200
## 15       0.14   -54353400 67812600    1851    13459200
## 16       0.15   -54353400 67812600    1851    13459200
## 17       0.16   -54353400 67812600    1851    13459200
## 18       0.17   -54353400 67812600    1851    13459200
## 19       0.18   -54353400 67812600    1851    13459200
## 20       0.19   -54353400 67812600    1851    13459200
## 21       0.20   -54353400 67812600    1851    13459200
## 22       0.21   -54353400 67812600    1851    13459200
## 23       0.22   -54353400 67812600    1851    13459200
## 24       0.23   -54353400 67812600    1851    13459200
## 25       0.24   -54353400 67812600    1851    13459200
## 26       0.25   -54353400 67812600    1851    13459200
## 27       0.26   -54353400 67812600    1851    13459200
## 28       0.27   -54353400 67812600    1851    13459200
## 29       0.28   -54353400 67812600    1851    13459200
## 30       0.29   -54353400 67812600    1851    13459200
## 31       0.30   -54353400 67812600    1851    13459200
## 32       0.31   -54353400 67812600    1851    13459200
## 33       0.32   -54353400 67812600    1851    13459200
## 34       0.33   -54353400 67812600    1851    13459200
## 35       0.34   -54353400 67812600    1851    13459200
## 36       0.35   -54353400 67812600    1851    13459200
## 37       0.36   -54353400 67812600    1851    13459200
## 38       0.37   -54353400 67812600    1851    13459200
## 39       0.38   -54353400 67812600    1851    13459200
## 40       0.39   -54353400 67812600    1851    13459200
## 41       0.40   -54353400 67812600    1851    13459200
## 42       0.41   -54353400 67812600    1851    13459200
## 43       0.42   -54353400 67812600    1851    13459200
## 44       0.43   -54353400 67812600    1851    13459200
## 45       0.44   -54353400 67812600    1851    13459200
## 46       0.45   -54353400 67812600    1851    13459200
## 47       0.46   -54353400 67812600    1851    13459200
## 48       0.47   -54353400 67812600    1851    13459200
## 49       0.48   -54353400 67812600    1851    13459200
## 50       0.49   -54353400 67812600    1851    13459200
## 51       0.50   -54353400 67812600    1851    13459200
## 52       0.51   -54353400 67812600    1851    13459200
## 53       0.52   -54353400 67812600    1851    13459200
## 54       0.53   -54353400 67812600    1851    13459200
## 55       0.54   -54353400 67812600    1851    13459200
## 56       0.55   -54353400 67812600    1851    13459200
## 57       0.56   -54353400 67812600    1851    13459200
## 58       0.57   -54353400 67812600    1851    13459200
## 59       0.58   -54353400 67812600    1851    13459200
## 60       0.59   -54353400 67812600    1851    13459200
## 61       0.60   -54353400 67812600    1851    13459200
## 62       0.61   -54353400 67812600    1851    13459200
## 63       0.62   -54353400 67812600    1851    13459200
## 64       0.63   -54353400 67812600    1851    13459200
## 65       0.64   -54353400 67812600    1851    13459200
## 66       0.65   -54353400 67812600    1851    13459200
## 67       0.66   -54353400 67812600    1851    13459200
## 68       0.67   -54353400 67812600    1851    13459200
## 69       0.68   -54353400 67812600    1851    13459200
## 70       0.69   -54353400 67812600    1851    13459200
## 71       0.70   -54353400 67812600    1851    13459200
## 72       0.71   -54353400 67812600    1851    13459200
## 73       0.72   -54353400 67812600    1851    13459200
## 74       0.73   -54353400 67812600    1851    13459200
## 75       0.74   -54353400 67812600    1851    13459200
## 76       0.75   -54350550 67815450    1851    13464900
## 77       0.76   -54350550 67815450    1851    13464900
## 78       0.77   -54350550 67815450    1851    13464900
## 79       0.78   -54350550 67815450    1851    13464900
## 80       0.79   -54350550 67815450    1851    13464900
## 81       0.80   -54347700 67818300    1851    13470600
## 82       0.81   -54342000 67824000    1851    13482000
## 83       0.82   -54330600 67835400    1851    13504800
## 84       0.83   -54313500 67852500    1851    13539000
## 85       0.84   -54304950 67861050    1851    13556100
## 86       0.85   -54262200 67903800    1851    13641600
## 87       0.86   -54213750 67952250    1851    13738500
## 88       0.87   -54162450 68003550    1851    13841100
## 89       0.88   -54096900 68069100    1851    13972200
## 90       0.89   -53917350 68248650    1851    14331300
## 91       0.90   -53649450 68516550    1851    14867100
## 92       0.91   -53395800 68770200    1851    15374400
## 93       0.92   -53028150 69137850    1851    16109700
## 94       0.93   -52529400 69636600    1851    17107200
## 95       0.94   -51956550 70209450    1851    18252900
## 96       0.95   -51233350 70866650    1851    19633300
## 97       0.96   -50363350 71538650    1851    21175300
## 98       0.97   -49412850 72357150    1851    22944300
## 99       0.98   -48246450 73325550    1851    25079100
## 100      0.99   -47074350 74299650    1851    27225300
## 101      1.00   -45688500 75487500    1851    29799000
## 102      1.01   -44314000 76400000    1851    32086000
## 103      1.02   -42832650 77551350    1851    34718700
## 104      1.03   -41287100 78370900    1851    37083800
## 105      1.04   -39757300 79570700    1851    39813400
## 106      1.05   -38015100 80586900    1851    42571800
## 107      1.06   -36535100 81340900    1851    44805800
## 108      1.07   -35054450 82425550    1851    47371100
## 109      1.08   -33446200 83307800    1851    49861600
## 110      1.09   -31823000 84271000    1851    52448000
## 111      1.10   -30139100 84568900    1851    54429800
## 112      1.11   -28349750 84972250    1851    56622500
## 113      1.12   -26811200 85124800    1851    58313600
## 114      1.13   -25341700 84878300    1851    59536600
## 115      1.14   -23979750 84326250    1851    60346500
## 116      1.15   -22564950 83167050    1851    60602100
## 117      1.16   -21210650 81617350    1851    60406700
## 118      1.17   -19969700 80284300    1851    60314600
## 119      1.18   -18730000 78026000    1851    59296000
## 120      1.19   -17455250 75076750    1851    57621500
## 121      1.20   -16584800 73373200    1851    56788400

Deliverable 6(c): Plot the confusion metric outcomes for each Threshold.

# Print confusion matrix for each Threashold

for (i in 1:length(thresholds)) {
  threshold <- thresholds[i]

  pred_labels <- ifelse(pred_engineered > threshold, "Yes", "No")
  
  TP <- sum(pred_labels == "Yes" & test$y == "2") #Yes=2 & No=1
  TN <- sum(pred_labels == "No" & test$y == "1")
  FP <- sum(pred_labels == "Yes" & test$y == "1")
  FN <- sum(pred_labels == "No" & test$y == "2")

  cat("Confusion Matrix for threshold", threshold, ":\n")
  cat("       Actual Yes   Actual No\n")
  cat("Pred Yes ", TP, "         ", FP, "\n")
  cat("Pred No  ", FN, "         ", TN, "\n")
  
  # Compute credits
  credits[i] <- TP + FN

}
## Confusion Matrix for threshold 0 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.01 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.02 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.03 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.04 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.05 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.06 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.07 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.08 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.09 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.1 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.11 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.12 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.13 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.14 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.15 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.16 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.17 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.18 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.19 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.2 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.21 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.22 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.23 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.24 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.25 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.26 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.27 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.28 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.29 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.3 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.31 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.32 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.33 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.34 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.35 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.36 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.37 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.38 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.39 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.4 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.41 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.42 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.43 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.44 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.45 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.46 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.47 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.48 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.49 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.5 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.51 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.52 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.53 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.54 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.55 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.56 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.57 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.58 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.59 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.6 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.61 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.62 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.63 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.64 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.65 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.66 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.67 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.68 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.69 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.7 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.71 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.72 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.73 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.74 :
##        Actual Yes   Actual No
## Pred Yes  1851           13973 
## Pred No   0           0 
## Confusion Matrix for threshold 0.75 :
##        Actual Yes   Actual No
## Pred Yes  1851           13972 
## Pred No   0           1 
## Confusion Matrix for threshold 0.76 :
##        Actual Yes   Actual No
## Pred Yes  1851           13972 
## Pred No   0           1 
## Confusion Matrix for threshold 0.77 :
##        Actual Yes   Actual No
## Pred Yes  1851           13972 
## Pred No   0           1 
## Confusion Matrix for threshold 0.78 :
##        Actual Yes   Actual No
## Pred Yes  1851           13972 
## Pred No   0           1 
## Confusion Matrix for threshold 0.79 :
##        Actual Yes   Actual No
## Pred Yes  1851           13972 
## Pred No   0           1 
## Confusion Matrix for threshold 0.8 :
##        Actual Yes   Actual No
## Pred Yes  1851           13971 
## Pred No   0           2 
## Confusion Matrix for threshold 0.81 :
##        Actual Yes   Actual No
## Pred Yes  1851           13969 
## Pred No   0           4 
## Confusion Matrix for threshold 0.82 :
##        Actual Yes   Actual No
## Pred Yes  1851           13965 
## Pred No   0           8 
## Confusion Matrix for threshold 0.83 :
##        Actual Yes   Actual No
## Pred Yes  1851           13959 
## Pred No   0           14 
## Confusion Matrix for threshold 0.84 :
##        Actual Yes   Actual No
## Pred Yes  1851           13956 
## Pred No   0           17 
## Confusion Matrix for threshold 0.85 :
##        Actual Yes   Actual No
## Pred Yes  1851           13941 
## Pred No   0           32 
## Confusion Matrix for threshold 0.86 :
##        Actual Yes   Actual No
## Pred Yes  1851           13924 
## Pred No   0           49 
## Confusion Matrix for threshold 0.87 :
##        Actual Yes   Actual No
## Pred Yes  1851           13906 
## Pred No   0           67 
## Confusion Matrix for threshold 0.88 :
##        Actual Yes   Actual No
## Pred Yes  1851           13883 
## Pred No   0           90 
## Confusion Matrix for threshold 0.89 :
##        Actual Yes   Actual No
## Pred Yes  1851           13820 
## Pred No   0           153 
## Confusion Matrix for threshold 0.9 :
##        Actual Yes   Actual No
## Pred Yes  1851           13726 
## Pred No   0           247 
## Confusion Matrix for threshold 0.91 :
##        Actual Yes   Actual No
## Pred Yes  1851           13637 
## Pred No   0           336 
## Confusion Matrix for threshold 0.92 :
##        Actual Yes   Actual No
## Pred Yes  1851           13508 
## Pred No   0           465 
## Confusion Matrix for threshold 0.93 :
##        Actual Yes   Actual No
## Pred Yes  1851           13333 
## Pred No   0           640 
## Confusion Matrix for threshold 0.94 :
##        Actual Yes   Actual No
## Pred Yes  1851           13132 
## Pred No   0           841 
## Confusion Matrix for threshold 0.95 :
##        Actual Yes   Actual No
## Pred Yes  1850           12881 
## Pred No   1           1092 
## Confusion Matrix for threshold 0.96 :
##        Actual Yes   Actual No
## Pred Yes  1847           12584 
## Pred No   4           1389 
## Confusion Matrix for threshold 0.97 :
##        Actual Yes   Actual No
## Pred Yes  1845           12256 
## Pred No   6           1717 
## Confusion Matrix for threshold 0.98 :
##        Actual Yes   Actual No
## Pred Yes  1842           11855 
## Pred No   9           2118 
## Confusion Matrix for threshold 0.99 :
##        Actual Yes   Actual No
## Pred Yes  1839           11452 
## Pred No   12           2521 
## Confusion Matrix for threshold 1 :
##        Actual Yes   Actual No
## Pred Yes  1836           10974 
## Pred No   15           2999 
## Confusion Matrix for threshold 1.01 :
##        Actual Yes   Actual No
## Pred Yes  1829           10511 
## Pred No   22           3462 
## Confusion Matrix for threshold 1.02 :
##        Actual Yes   Actual No
## Pred Yes  1824           10005 
## Pred No   27           3968 
## Confusion Matrix for threshold 1.03 :
##        Actual Yes   Actual No
## Pred Yes  1813           9493 
## Pred No   38           4480 
## Confusion Matrix for threshold 1.04 :
##        Actual Yes   Actual No
## Pred Yes  1808           8970 
## Pred No   43           5003 
## Confusion Matrix for threshold 1.05 :
##        Actual Yes   Actual No
## Pred Yes  1797           8389 
## Pred No   54           5584 
## Confusion Matrix for threshold 1.06 :
##        Actual Yes   Actual No
## Pred Yes  1786           7900 
## Pred No   65           6073 
## Confusion Matrix for threshold 1.07 :
##        Actual Yes   Actual No
## Pred Yes  1780           7397 
## Pred No   71           6576 
## Confusion Matrix for threshold 1.08 :
##        Actual Yes   Actual No
## Pred Yes  1769           6863 
## Pred No   82           7110 
## Confusion Matrix for threshold 1.09 :
##        Actual Yes   Actual No
## Pred Yes  1759           6321 
## Pred No   92           7652 
## Confusion Matrix for threshold 1.1 :
##        Actual Yes   Actual No
## Pred Yes  1738           5788 
## Pred No   113           8185 
## Confusion Matrix for threshold 1.11 :
##        Actual Yes   Actual No
## Pred Yes  1717           5218 
## Pred No   134           8755 
## Confusion Matrix for threshold 1.12 :
##        Actual Yes   Actual No
## Pred Yes  1696           4736 
## Pred No   155           9237 
## Confusion Matrix for threshold 1.13 :
##        Actual Yes   Actual No
## Pred Yes  1670           4292 
## Pred No   181           9681 
## Confusion Matrix for threshold 1.14 :
##        Actual Yes   Actual No
## Pred Yes  1641           3894 
## Pred No   210           10079 
## Confusion Matrix for threshold 1.15 :
##        Actual Yes   Actual No
## Pred Yes  1602           3505 
## Pred No   249           10468 
## Confusion Matrix for threshold 1.16 :
##        Actual Yes   Actual No
## Pred Yes  1558           3151 
## Pred No   293           10822 
## Confusion Matrix for threshold 1.17 :
##        Actual Yes   Actual No
## Pred Yes  1519           2823 
## Pred No   332           11150 
## Confusion Matrix for threshold 1.18 :
##        Actual Yes   Actual No
## Pred Yes  1466           2534 
## Pred No   385           11439 
## Confusion Matrix for threshold 1.19 :
##        Actual Yes   Actual No
## Pred Yes  1402           2263 
## Pred No   449           11710 
## Confusion Matrix for threshold 1.2 :
##        Actual Yes   Actual No
## Pred Yes  1363           2065 
## Pred No   488           11908

Deliverable 6(d): Create two small multiple plots that show Threshold as a function of Total_Revenue and Total_Count_of_Credits. Interpret this.

# Create data frame for plotting
plot_data <- data.frame(Threshold = thresholds, Total_Revenue = benefits, Total_Count_of_Credits = credits)

# Plot Threshold vs. Total Revenue
ggplot(plot_data, aes(x = Threshold, y = Total_Revenue)) + geom_line() + ggtitle("Threshold vs. Total Revenue")

# Plot Threshold vs. Total Count of Credits
ggplot(plot_data, aes(x = Threshold, y = Total_Count_of_Credits)) + geom_line() + ggtitle("Threshold vs. Total Count of Credits")

  1. Threshold vs. Total Revenue: The graph reveals a distinct pattern where the Total Revenue increases as the Threshold rises, reaching its peak value at a Threshold of 1.12. Beyond this point, as the Threshold continues to increase, there is a subsequent decrease in Total Revenue. As the objective is maximising the revenue, our analysis suggests setting the threshold at 1.12 is most beneficial.

  2. Threshold vs. Total Count of Credits: The total Count of Credits remains constant at a value of 1851, irrespective of changes in the Threshold. This indicates that the Total Count of Credits is not influenced by variations in the Threshold value. But as we see from the confusion matrix above, redistributing the credit across TP, TN, FP, and FN can have a significant impact on the revenue, which can be crucial in our case.

Deliverable 6(e): Create a table of the Total_Revenue and Total_Count_of_Credits allocated for 2 categories. 50%_Threshold and your Optimal_Threshold.

optimal_threshold <- thresholds[which.max(benefits)]

results <- data.frame(
  Category = c("50% Threshold", "Optimal Threshold"),
  Threshold = c(0.5, optimal_threshold),
  Total_Revenue = c(benefits[which(thresholds == 0.5)], max(benefits)),
  Total_Count_of_Credits = c(credits[which(thresholds == 0.5)], credits[which.max(benefits)])
)

print(results)
##            Category Threshold Total_Revenue Total_Count_of_Credits
## 1     50% Threshold      0.50      67812600                   1851
## 2 Optimal Threshold      1.12      85124800                   1851

Deliverable 7: Conclude whether and why this model should or shouldn’t be put into production. What could make the model better? What would you do to ensure that the marketing materials resulted in a better response rate?

The engineered model shows significant promise in properly redistributing the home repair tax credit program in Emil City. The potential for increased revenue suggests a positive impact. However, before fully deploying the model in production, it is crucial to validate its results, understand the implications of the chosen threshold, and ensure that outreach strategies align with the model’s predictions. The ultimate goal is to balance increased revenue with community benefit, ensuring that homeowners genuinely in need are reached and can benefit from the program, as well as setting the right policies and governance to ensure that eligible applicants are those who deserve to benefit, as we might know there will always be some loopholes in any system, and to ensure that the marketing materials results in a better response rate, we might need to have regular validations and insurance mechanisms that the system is not abused.