setwd("C:\\Users\\M.Mando\\Desktop\\M3-data analytics")
data <- read.csv("bank-full.csv",header = TRUE, sep = ";")
attach(data)
head(data)
The Emil City HCD seeks to improve the uptake of the home repair tax credit program. Historically, a broad and unspecific outreach has led to a mere 11% uptake among eligible homeowners. By leveraging data from previous campaigns, we aim to use predictive modeling to target homeowners more effectively, optimizing resource allocation and increasing the overall benefit to the community
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(caret)
## Loading required package: lattice
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-8
library(caTools)
library(ROCR)
# Correlation Matrix Visualization
corr_matrix <- cor(data[, sapply(data, is.numeric)])
corrplot(corr_matrix, method = "circle")
The correlations between these variables are generally weak, meaning that there is no strong linear relationship between them.
# Define a color palette
palette2 <- scales::hue_pal()(2)
data %>%
select(y, age, balance) %>%
gather(Variable, value, -y) %>%
ggplot(aes(y, value, fill = y)) +
geom_bar(position = "dodge", stat = "summary", fun = "mean") +
facet_wrap(~Variable, scales = "free") +
scale_fill_manual(values = palette2) +
labs(x = "Take the credit", y = "Mean",
title = "Feature associations with the decision to take the credit (Yes/No)",
subtitle = "(Continuous outcomes)") +
theme(legend.position = "none")
data %>%
select(y, job, marital, education, default, housing, loan, contact, month, poutcome) %>%
pivot_longer(cols = c(job, marital, education, default, housing, loan, contact, month, poutcome),
names_to = "Variable", values_to = "value") %>%
count(Variable, value, y) %>%
filter(value == "yes") %>%
ggplot(aes(y, n, fill = y)) +
geom_bar(position = "dodge", stat = "identity") +
facet_wrap(~Variable, scales = "free", ncol = 3) +
scale_fill_manual(values = c("yes" = "blue", "no" = "red")) +
labs(x = "Take the credit", y = "Count",
title = "Feature associations with the decision to take the credit (Yes/No)",
subtitle = "Categorical features (Yes and No)") +
theme(legend.position = "none")
data %>%
select(y, job, marital, education, default, contact, month, poutcome) %>%
pivot_longer(cols = c(job, marital, education, contact, month, poutcome),
names_to = "Variable", values_to = "value") %>%
count(Variable, value, y) %>%
filter(!is.na(value)) %>%
ggplot(aes(value, n, fill = y)) +
geom_bar(position = "dodge", stat = "identity") +
facet_wrap(~Variable, scales = "free") +
scale_fill_manual(values = c("yes" = "blue", "no" = "red")) +
labs(x = "Value", y = "Count",
title = "Feature associations with the decision to take the credit (Yes/No)",
subtitle = "Three-category features or more") +
theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))
These bar charts indicate that the dataset is imbalanced, with “no” responses being the predominant class. We can see how categorical variables are related to the decision to take the credit, even though “yes” responses are less frequent. Further analysis and feature engineering are required to assess these categorical features’ statistical significance and predictive power.
# Encode categorical variables using one-hot encoding
df <- data %>%
mutate_if(is.character, as.factor) %>%
select_if(is.factor) %>%
mutate_all(funs(as.numeric(.)))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
set.seed(123)
split <- sample.split(df$y, SplitRatio = 0.65)
train_df <- subset(df, split == TRUE)
test_df <- subset(df, split == FALSE)
# Binning age into age groups
df1 <- df %>%
mutate(age_group = cut(age, breaks = c(0, 30, 50, Inf), labels = c("young", "middle-aged", "senior")))
# Perform feature scaling on numeric variables (e.g., z-score standardization)
numeric_vars <- c("age", "balance", "duration", "campaign", "pdays", "previous")
df1[numeric_vars] <- scale(data[numeric_vars])
# Create interaction terms (example: housing_loan_interaction)
df1 <- df1 %>%
mutate(
housing_loan_interaction = housing * loan,
education_job_interaction = education * job,
balance_duration_ratio = balance / duration,
balance_to_age_ratio = balance / age
)
housing_loan_interaction represents an interaction between housing and loan, capturing the combined effects of both features, which may indicate a higher likelihood of homeowners taking the credit
education_job_interaction represents an interaction between education and job, this can be useful in exploring whether certain combinations of education and job types have a unique influence on the data.
balance_duration_ratio represents the ratio of the bank balance to the duration of the last contact. It can provide insights into how a customer’s financial stability relates to the length of the contact and potentially their decision-making.
balance_to_age_ratio represents the ratio of bank balance to age. Which might help in understanding how financial resources are distributed relative to a person’s age, which could be relevant for analyzing customer behavior.
set.seed(123)
split <- sample.split(df1$y, SplitRatio = 0.65)
train <- subset(df1, split == TRUE)
test <- subset(df1, split == FALSE)
# Kitchen Sink Model: Using all available features
kitchen_sink_model <- lm(y ~ ., data = train_df)
# Engineered Regression Model: Using selected features including engineered ones
engineered_model <- lm(y ~ . , data = train)
# Display regression summaries
summary(kitchen_sink_model)
##
## Call:
## lm(formula = y ~ ., data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3271 -0.1584 -0.1026 -0.0300 1.0684
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3694913 0.0199794 68.545 < 2e-16 ***
## job 0.0009618 0.0005725 1.680 0.0930 .
## marital 0.0148599 0.0030376 4.892 1.00e-06 ***
## education 0.0156917 0.0025144 6.241 4.42e-10 ***
## default -0.0349005 0.0140066 -2.492 0.0127 *
## housing -0.0839684 0.0039090 -21.481 < 2e-16 ***
## loan -0.0525500 0.0049803 -10.552 < 2e-16 ***
## contact -0.0444409 0.0023362 -19.023 < 2e-16 ***
## month 0.0063292 0.0006755 9.370 < 2e-16 ***
## poutcome -0.0184588 0.0019560 -9.437 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3136 on 29377 degrees of freedom
## Multiple R-squared: 0.04804, Adjusted R-squared: 0.04775
## F-statistic: 164.7 on 9 and 29377 DF, p-value: < 2.2e-16
summary(engineered_model)
##
## Call:
## lm(formula = y ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.95442 -0.13795 -0.05951 0.02123 1.10562
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.392e+00 2.891e-02 48.149 < 2e-16 ***
## job 2.977e-04 1.690e-03 0.176 0.860203
## marital 1.321e-02 3.065e-03 4.311 1.63e-05 ***
## education 1.634e-02 4.411e-03 3.705 0.000212 ***
## default -1.481e-02 1.276e-02 -1.161 0.245769
## housing -1.667e-01 1.134e-02 -14.701 < 2e-16 ***
## loan -1.636e-01 1.540e-02 -10.625 < 2e-16 ***
## contact -3.828e-02 2.132e-03 -17.957 < 2e-16 ***
## month 4.897e-03 6.191e-04 7.910 2.67e-15 ***
## poutcome 2.202e-02 3.374e-03 6.525 6.93e-11 ***
## age_groupmiddle-aged -6.713e-02 5.971e-03 -11.242 < 2e-16 ***
## age_groupsenior -8.305e-02 1.073e-02 -7.739 1.04e-14 ***
## age 2.513e-02 3.396e-03 7.401 1.39e-13 ***
## balance 7.659e-03 1.693e-03 4.523 6.11e-06 ***
## duration 1.232e-01 1.662e-03 74.118 < 2e-16 ***
## campaign -9.126e-03 1.685e-03 -5.415 6.19e-08 ***
## pdays 4.065e-02 3.245e-03 12.529 < 2e-16 ***
## previous 1.476e-02 1.725e-03 8.557 < 2e-16 ***
## housing_loan_interaction 7.321e-02 9.197e-03 7.960 1.79e-15 ***
## education_job_interaction 2.105e-04 7.526e-04 0.280 0.779696
## balance_duration_ratio 1.765e-05 1.952e-05 0.905 0.365688
## balance_to_age_ratio 1.016e-05 7.015e-05 0.145 0.884840
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.285 on 29365 degrees of freedom
## Multiple R-squared: 0.2142, Adjusted R-squared: 0.2136
## F-statistic: 381.2 on 21 and 29365 DF, p-value: < 2.2e-16
# Create a control object for cross-validation
ctrl <- trainControl(method = "cv", number = 5)
# Kitchen Sink Model: Using all available features
kitchen_sink_model_cv <- train(y ~ ., data = train_df, method = "lm", trControl = ctrl)
## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.
# Engineered Regression Model: Using selected features including engineered ones
engineered_model_cv <- train(y ~ ., data = train, method = "lm", trControl = ctrl)
## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.
# Display cross-validation results
print(kitchen_sink_model_cv)
## Linear Regression
##
## 29387 samples
## 9 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 23510, 23509, 23510, 23510, 23509
## Resampling results:
##
## RMSE Rsquared MAE
## 0.3136448 0.04779809 0.1984907
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
print(engineered_model_cv)
## Linear Regression
##
## 29387 samples
## 20 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 23509, 23510, 23509, 23510, 23510
## Resampling results:
##
## RMSE Rsquared MAE
## 0.2857738 0.2098292 0.1797369
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Load the pROC package
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Create ROC curves and calculate Sensitivity and Specificity for both models
roc_kitchen_sink <- roc(train_df$y, predict(kitchen_sink_model, newdata = train_df))
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
roc_engineered <- roc(train$y, predict(engineered_model, newdata = train))
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
# Create facetted plots
par(mfrow=c(1,2)) # Arrange plots side by side
# Plot ROC curve for the "kitchen sink" model
plot(roc_kitchen_sink, main = "ROC Curve - Kitchen Sink Model", col = "blue")
legend("bottomright", legend = paste("AUC =", round(auc(roc_kitchen_sink), 2)), col = "blue")
# Plot ROC curve for the "engineered" model
plot(roc_engineered, main = "ROC Curve - Engineered Model", col = "red")
legend("bottomright", legend = paste("AUC =", round(auc(roc_engineered), 2)), col = "red")
# Reset the plotting layout
par(mfrow=c(1,1))
The”Engineered” model with an AUC of 0.87 demonstrates better predictive performance, while the “Kitchen Sink” model with an AUC of 0.69 performs less effectively in discriminating between the classes. This means that the interactions and ratios added to the model have a substantial role in this discrimination.
# Predict on test data
pred_all <- predict(kitchen_sink_model, newdata=test_df, type="response")
pred_engineered <- predict(engineered_model, newdata=test, type="response")
# Create prediction objects
predObj_all <- prediction(pred_all, test_df$y)
perfObj_all <- performance(predObj_all, "tpr", "fpr")
predObj_eng <- prediction(pred_engineered, test$y)
perfObj_eng <- performance(predObj_eng, "tpr", "fpr")
# Plot ROC
plot(perfObj_all, colorize=TRUE)
plot(perfObj_eng, add=TRUE, col="red")
When applying both models to the test data, we observe similar results to those obtained on the training datasets. This consistency indicates that the models generalize well to new, unseen data. Importantly, the “Engineered” model continues to demonstrate superior predictive performance, while the “Kitchen Sink” model consistently performs less effectively in distinguishing between the two classes. These findings suggest that the feature-engineered model is a robust and reliable choice for making predictions on both the training and test datasets, highlighting its overall effectiveness in capturing the underlying patterns in the data.
roc_engineered <- roc(test$y, predict(engineered_model, newdata = test))
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
# Plot ROC curve for the "engineered" model
plot(roc_engineered, main = "ROC Curve - Engineered Model", col = "darkgreen")
legend("bottomright", legend = paste("AUC =", round(auc(roc_engineered), 2)), col = "darkgreen")
# Reset the plotting layout
par(mfrow=c(1,1))
True Positive (TP): Predicted they would take the credit and they did.
Cost: Cost of marketing allocation
(-2850) + cost of the credit
(-5000)
Benefit: Value from the sale of the improved
home (+10000) + aggregate premium of homes
surrounding the repaired home
(+56000).
Equation 1: Benefit(TP) = 10,000 + 56,000 − 2,850 − 5,000
True Negative (TN): Predicted they wouldn’t take the credit and they didn’t.
Cost: $0 (no costs incurred)
Benefit: $0 (no benefits either)
Equation 2: Benefit(TN) = 0
False Positive (FP): Predicted they would take the credit, but they didn’t.
Cost: Cost of marketing allocation
(-2850)
Benefit: $0 (since they didn’t take the credit)
Equation 3: Benefit(FP) = −2,850
False Negative (FN): Predicted they wouldn’t take the credit, but they did.
Cost: $0 (no marketing allocation)
Benefit: $0 (since the model didn’t predict them)
Equation 4:Benefit(FN) = 0
# Define thresholds
thresholds <- seq(0, 1.2, 0.01)
# Initialize vectors to store the calculated values
total_costs <- numeric(length(thresholds))
benefits <- numeric(length(thresholds))
credits <- numeric(length(thresholds))
for (i in 1:length(thresholds)) {
threshold <- thresholds[i]
pred_labels <- ifelse(pred_engineered > threshold, "Yes", "No")
TP <- sum(pred_labels == "Yes" & test$y == "2") #Yes=2 & No=1
TN <- sum(pred_labels == "No" & test$y == "1")
FP <- sum(pred_labels == "Yes" & test$y == "1")
FN <- sum(pred_labels == "No" & test$y == "2")
# Compute costs
TP_cost <- TP * -7850
TN_cost <- 0
FP_cost <- FP * -2850
FN_cost <- 0
total_costs[i] <- TP_cost + TN_cost + FP_cost + FN_cost
# Compute benefits
benefit_TP <- (10000 + 56000 - 2850 - 5000) * TP
benefit_TN <- 0 * TN
benefit_FP <- -2850 * FP
benefit_FN <- 0 * FN
benefits[i] <- benefit_TP + benefit_TN + benefit_FP + benefit_FN
# Compute credits
credits[i] <- TP + FN
}
# Create a data frame to display the results
cost_benefit_table <- data.frame(
Threshold = thresholds,
Total_Costs = total_costs,
Benefits = benefits,
Credits = credits,
Net_Benefit = benefits + total_costs # assuming net benefit is benefits - costs
)
print(cost_benefit_table)
## Threshold Total_Costs Benefits Credits Net_Benefit
## 1 0.00 -54353400 67812600 1851 13459200
## 2 0.01 -54353400 67812600 1851 13459200
## 3 0.02 -54353400 67812600 1851 13459200
## 4 0.03 -54353400 67812600 1851 13459200
## 5 0.04 -54353400 67812600 1851 13459200
## 6 0.05 -54353400 67812600 1851 13459200
## 7 0.06 -54353400 67812600 1851 13459200
## 8 0.07 -54353400 67812600 1851 13459200
## 9 0.08 -54353400 67812600 1851 13459200
## 10 0.09 -54353400 67812600 1851 13459200
## 11 0.10 -54353400 67812600 1851 13459200
## 12 0.11 -54353400 67812600 1851 13459200
## 13 0.12 -54353400 67812600 1851 13459200
## 14 0.13 -54353400 67812600 1851 13459200
## 15 0.14 -54353400 67812600 1851 13459200
## 16 0.15 -54353400 67812600 1851 13459200
## 17 0.16 -54353400 67812600 1851 13459200
## 18 0.17 -54353400 67812600 1851 13459200
## 19 0.18 -54353400 67812600 1851 13459200
## 20 0.19 -54353400 67812600 1851 13459200
## 21 0.20 -54353400 67812600 1851 13459200
## 22 0.21 -54353400 67812600 1851 13459200
## 23 0.22 -54353400 67812600 1851 13459200
## 24 0.23 -54353400 67812600 1851 13459200
## 25 0.24 -54353400 67812600 1851 13459200
## 26 0.25 -54353400 67812600 1851 13459200
## 27 0.26 -54353400 67812600 1851 13459200
## 28 0.27 -54353400 67812600 1851 13459200
## 29 0.28 -54353400 67812600 1851 13459200
## 30 0.29 -54353400 67812600 1851 13459200
## 31 0.30 -54353400 67812600 1851 13459200
## 32 0.31 -54353400 67812600 1851 13459200
## 33 0.32 -54353400 67812600 1851 13459200
## 34 0.33 -54353400 67812600 1851 13459200
## 35 0.34 -54353400 67812600 1851 13459200
## 36 0.35 -54353400 67812600 1851 13459200
## 37 0.36 -54353400 67812600 1851 13459200
## 38 0.37 -54353400 67812600 1851 13459200
## 39 0.38 -54353400 67812600 1851 13459200
## 40 0.39 -54353400 67812600 1851 13459200
## 41 0.40 -54353400 67812600 1851 13459200
## 42 0.41 -54353400 67812600 1851 13459200
## 43 0.42 -54353400 67812600 1851 13459200
## 44 0.43 -54353400 67812600 1851 13459200
## 45 0.44 -54353400 67812600 1851 13459200
## 46 0.45 -54353400 67812600 1851 13459200
## 47 0.46 -54353400 67812600 1851 13459200
## 48 0.47 -54353400 67812600 1851 13459200
## 49 0.48 -54353400 67812600 1851 13459200
## 50 0.49 -54353400 67812600 1851 13459200
## 51 0.50 -54353400 67812600 1851 13459200
## 52 0.51 -54353400 67812600 1851 13459200
## 53 0.52 -54353400 67812600 1851 13459200
## 54 0.53 -54353400 67812600 1851 13459200
## 55 0.54 -54353400 67812600 1851 13459200
## 56 0.55 -54353400 67812600 1851 13459200
## 57 0.56 -54353400 67812600 1851 13459200
## 58 0.57 -54353400 67812600 1851 13459200
## 59 0.58 -54353400 67812600 1851 13459200
## 60 0.59 -54353400 67812600 1851 13459200
## 61 0.60 -54353400 67812600 1851 13459200
## 62 0.61 -54353400 67812600 1851 13459200
## 63 0.62 -54353400 67812600 1851 13459200
## 64 0.63 -54353400 67812600 1851 13459200
## 65 0.64 -54353400 67812600 1851 13459200
## 66 0.65 -54353400 67812600 1851 13459200
## 67 0.66 -54353400 67812600 1851 13459200
## 68 0.67 -54353400 67812600 1851 13459200
## 69 0.68 -54353400 67812600 1851 13459200
## 70 0.69 -54353400 67812600 1851 13459200
## 71 0.70 -54353400 67812600 1851 13459200
## 72 0.71 -54353400 67812600 1851 13459200
## 73 0.72 -54353400 67812600 1851 13459200
## 74 0.73 -54353400 67812600 1851 13459200
## 75 0.74 -54353400 67812600 1851 13459200
## 76 0.75 -54350550 67815450 1851 13464900
## 77 0.76 -54350550 67815450 1851 13464900
## 78 0.77 -54350550 67815450 1851 13464900
## 79 0.78 -54350550 67815450 1851 13464900
## 80 0.79 -54350550 67815450 1851 13464900
## 81 0.80 -54347700 67818300 1851 13470600
## 82 0.81 -54342000 67824000 1851 13482000
## 83 0.82 -54330600 67835400 1851 13504800
## 84 0.83 -54313500 67852500 1851 13539000
## 85 0.84 -54304950 67861050 1851 13556100
## 86 0.85 -54262200 67903800 1851 13641600
## 87 0.86 -54213750 67952250 1851 13738500
## 88 0.87 -54162450 68003550 1851 13841100
## 89 0.88 -54096900 68069100 1851 13972200
## 90 0.89 -53917350 68248650 1851 14331300
## 91 0.90 -53649450 68516550 1851 14867100
## 92 0.91 -53395800 68770200 1851 15374400
## 93 0.92 -53028150 69137850 1851 16109700
## 94 0.93 -52529400 69636600 1851 17107200
## 95 0.94 -51956550 70209450 1851 18252900
## 96 0.95 -51233350 70866650 1851 19633300
## 97 0.96 -50363350 71538650 1851 21175300
## 98 0.97 -49412850 72357150 1851 22944300
## 99 0.98 -48246450 73325550 1851 25079100
## 100 0.99 -47074350 74299650 1851 27225300
## 101 1.00 -45688500 75487500 1851 29799000
## 102 1.01 -44314000 76400000 1851 32086000
## 103 1.02 -42832650 77551350 1851 34718700
## 104 1.03 -41287100 78370900 1851 37083800
## 105 1.04 -39757300 79570700 1851 39813400
## 106 1.05 -38015100 80586900 1851 42571800
## 107 1.06 -36535100 81340900 1851 44805800
## 108 1.07 -35054450 82425550 1851 47371100
## 109 1.08 -33446200 83307800 1851 49861600
## 110 1.09 -31823000 84271000 1851 52448000
## 111 1.10 -30139100 84568900 1851 54429800
## 112 1.11 -28349750 84972250 1851 56622500
## 113 1.12 -26811200 85124800 1851 58313600
## 114 1.13 -25341700 84878300 1851 59536600
## 115 1.14 -23979750 84326250 1851 60346500
## 116 1.15 -22564950 83167050 1851 60602100
## 117 1.16 -21210650 81617350 1851 60406700
## 118 1.17 -19969700 80284300 1851 60314600
## 119 1.18 -18730000 78026000 1851 59296000
## 120 1.19 -17455250 75076750 1851 57621500
## 121 1.20 -16584800 73373200 1851 56788400
# Print confusion matrix for each Threashold
for (i in 1:length(thresholds)) {
threshold <- thresholds[i]
pred_labels <- ifelse(pred_engineered > threshold, "Yes", "No")
TP <- sum(pred_labels == "Yes" & test$y == "2") #Yes=2 & No=1
TN <- sum(pred_labels == "No" & test$y == "1")
FP <- sum(pred_labels == "Yes" & test$y == "1")
FN <- sum(pred_labels == "No" & test$y == "2")
cat("Confusion Matrix for threshold", threshold, ":\n")
cat(" Actual Yes Actual No\n")
cat("Pred Yes ", TP, " ", FP, "\n")
cat("Pred No ", FN, " ", TN, "\n")
# Compute credits
credits[i] <- TP + FN
}
## Confusion Matrix for threshold 0 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.01 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.02 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.03 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.04 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.05 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.06 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.07 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.08 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.09 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.1 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.11 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.12 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.13 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.14 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.15 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.16 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.17 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.18 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.19 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.2 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.21 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.22 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.23 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.24 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.25 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.26 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.27 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.28 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.29 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.3 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.31 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.32 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.33 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.34 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.35 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.36 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.37 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.38 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.39 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.4 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.41 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.42 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.43 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.44 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.45 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.46 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.47 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.48 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.49 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.5 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.51 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.52 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.53 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.54 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.55 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.56 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.57 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.58 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.59 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.6 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.61 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.62 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.63 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.64 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.65 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.66 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.67 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.68 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.69 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.7 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.71 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.72 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.73 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.74 :
## Actual Yes Actual No
## Pred Yes 1851 13973
## Pred No 0 0
## Confusion Matrix for threshold 0.75 :
## Actual Yes Actual No
## Pred Yes 1851 13972
## Pred No 0 1
## Confusion Matrix for threshold 0.76 :
## Actual Yes Actual No
## Pred Yes 1851 13972
## Pred No 0 1
## Confusion Matrix for threshold 0.77 :
## Actual Yes Actual No
## Pred Yes 1851 13972
## Pred No 0 1
## Confusion Matrix for threshold 0.78 :
## Actual Yes Actual No
## Pred Yes 1851 13972
## Pred No 0 1
## Confusion Matrix for threshold 0.79 :
## Actual Yes Actual No
## Pred Yes 1851 13972
## Pred No 0 1
## Confusion Matrix for threshold 0.8 :
## Actual Yes Actual No
## Pred Yes 1851 13971
## Pred No 0 2
## Confusion Matrix for threshold 0.81 :
## Actual Yes Actual No
## Pred Yes 1851 13969
## Pred No 0 4
## Confusion Matrix for threshold 0.82 :
## Actual Yes Actual No
## Pred Yes 1851 13965
## Pred No 0 8
## Confusion Matrix for threshold 0.83 :
## Actual Yes Actual No
## Pred Yes 1851 13959
## Pred No 0 14
## Confusion Matrix for threshold 0.84 :
## Actual Yes Actual No
## Pred Yes 1851 13956
## Pred No 0 17
## Confusion Matrix for threshold 0.85 :
## Actual Yes Actual No
## Pred Yes 1851 13941
## Pred No 0 32
## Confusion Matrix for threshold 0.86 :
## Actual Yes Actual No
## Pred Yes 1851 13924
## Pred No 0 49
## Confusion Matrix for threshold 0.87 :
## Actual Yes Actual No
## Pred Yes 1851 13906
## Pred No 0 67
## Confusion Matrix for threshold 0.88 :
## Actual Yes Actual No
## Pred Yes 1851 13883
## Pred No 0 90
## Confusion Matrix for threshold 0.89 :
## Actual Yes Actual No
## Pred Yes 1851 13820
## Pred No 0 153
## Confusion Matrix for threshold 0.9 :
## Actual Yes Actual No
## Pred Yes 1851 13726
## Pred No 0 247
## Confusion Matrix for threshold 0.91 :
## Actual Yes Actual No
## Pred Yes 1851 13637
## Pred No 0 336
## Confusion Matrix for threshold 0.92 :
## Actual Yes Actual No
## Pred Yes 1851 13508
## Pred No 0 465
## Confusion Matrix for threshold 0.93 :
## Actual Yes Actual No
## Pred Yes 1851 13333
## Pred No 0 640
## Confusion Matrix for threshold 0.94 :
## Actual Yes Actual No
## Pred Yes 1851 13132
## Pred No 0 841
## Confusion Matrix for threshold 0.95 :
## Actual Yes Actual No
## Pred Yes 1850 12881
## Pred No 1 1092
## Confusion Matrix for threshold 0.96 :
## Actual Yes Actual No
## Pred Yes 1847 12584
## Pred No 4 1389
## Confusion Matrix for threshold 0.97 :
## Actual Yes Actual No
## Pred Yes 1845 12256
## Pred No 6 1717
## Confusion Matrix for threshold 0.98 :
## Actual Yes Actual No
## Pred Yes 1842 11855
## Pred No 9 2118
## Confusion Matrix for threshold 0.99 :
## Actual Yes Actual No
## Pred Yes 1839 11452
## Pred No 12 2521
## Confusion Matrix for threshold 1 :
## Actual Yes Actual No
## Pred Yes 1836 10974
## Pred No 15 2999
## Confusion Matrix for threshold 1.01 :
## Actual Yes Actual No
## Pred Yes 1829 10511
## Pred No 22 3462
## Confusion Matrix for threshold 1.02 :
## Actual Yes Actual No
## Pred Yes 1824 10005
## Pred No 27 3968
## Confusion Matrix for threshold 1.03 :
## Actual Yes Actual No
## Pred Yes 1813 9493
## Pred No 38 4480
## Confusion Matrix for threshold 1.04 :
## Actual Yes Actual No
## Pred Yes 1808 8970
## Pred No 43 5003
## Confusion Matrix for threshold 1.05 :
## Actual Yes Actual No
## Pred Yes 1797 8389
## Pred No 54 5584
## Confusion Matrix for threshold 1.06 :
## Actual Yes Actual No
## Pred Yes 1786 7900
## Pred No 65 6073
## Confusion Matrix for threshold 1.07 :
## Actual Yes Actual No
## Pred Yes 1780 7397
## Pred No 71 6576
## Confusion Matrix for threshold 1.08 :
## Actual Yes Actual No
## Pred Yes 1769 6863
## Pred No 82 7110
## Confusion Matrix for threshold 1.09 :
## Actual Yes Actual No
## Pred Yes 1759 6321
## Pred No 92 7652
## Confusion Matrix for threshold 1.1 :
## Actual Yes Actual No
## Pred Yes 1738 5788
## Pred No 113 8185
## Confusion Matrix for threshold 1.11 :
## Actual Yes Actual No
## Pred Yes 1717 5218
## Pred No 134 8755
## Confusion Matrix for threshold 1.12 :
## Actual Yes Actual No
## Pred Yes 1696 4736
## Pred No 155 9237
## Confusion Matrix for threshold 1.13 :
## Actual Yes Actual No
## Pred Yes 1670 4292
## Pred No 181 9681
## Confusion Matrix for threshold 1.14 :
## Actual Yes Actual No
## Pred Yes 1641 3894
## Pred No 210 10079
## Confusion Matrix for threshold 1.15 :
## Actual Yes Actual No
## Pred Yes 1602 3505
## Pred No 249 10468
## Confusion Matrix for threshold 1.16 :
## Actual Yes Actual No
## Pred Yes 1558 3151
## Pred No 293 10822
## Confusion Matrix for threshold 1.17 :
## Actual Yes Actual No
## Pred Yes 1519 2823
## Pred No 332 11150
## Confusion Matrix for threshold 1.18 :
## Actual Yes Actual No
## Pred Yes 1466 2534
## Pred No 385 11439
## Confusion Matrix for threshold 1.19 :
## Actual Yes Actual No
## Pred Yes 1402 2263
## Pred No 449 11710
## Confusion Matrix for threshold 1.2 :
## Actual Yes Actual No
## Pred Yes 1363 2065
## Pred No 488 11908
Threshold as a function of Total_Revenue and
Total_Count_of_Credits. Interpret this.# Create data frame for plotting
plot_data <- data.frame(Threshold = thresholds, Total_Revenue = benefits, Total_Count_of_Credits = credits)
# Plot Threshold vs. Total Revenue
ggplot(plot_data, aes(x = Threshold, y = Total_Revenue)) + geom_line() + ggtitle("Threshold vs. Total Revenue")
# Plot Threshold vs. Total Count of Credits
ggplot(plot_data, aes(x = Threshold, y = Total_Count_of_Credits)) + geom_line() + ggtitle("Threshold vs. Total Count of Credits")
Threshold vs. Total Revenue: The graph reveals a distinct pattern where the Total Revenue increases as the Threshold rises, reaching its peak value at a Threshold of 1.12. Beyond this point, as the Threshold continues to increase, there is a subsequent decrease in Total Revenue. As the objective is maximising the revenue, our analysis suggests setting the threshold at 1.12 is most beneficial.
Threshold vs. Total Count of Credits: The total Count of Credits remains constant at a value of 1851, irrespective of changes in the Threshold. This indicates that the Total Count of Credits is not influenced by variations in the Threshold value. But as we see from the confusion matrix above, redistributing the credit across TP, TN, FP, and FN can have a significant impact on the revenue, which can be crucial in our case.
Total_Revenue
and Total_Count_of_Credits allocated for 2 categories.
50%_Threshold and your Optimal_Threshold.optimal_threshold <- thresholds[which.max(benefits)]
results <- data.frame(
Category = c("50% Threshold", "Optimal Threshold"),
Threshold = c(0.5, optimal_threshold),
Total_Revenue = c(benefits[which(thresholds == 0.5)], max(benefits)),
Total_Count_of_Credits = c(credits[which(thresholds == 0.5)], credits[which.max(benefits)])
)
print(results)
## Category Threshold Total_Revenue Total_Count_of_Credits
## 1 50% Threshold 0.50 67812600 1851
## 2 Optimal Threshold 1.12 85124800 1851
The engineered model shows significant promise in properly redistributing the home repair tax credit program in Emil City. The potential for increased revenue suggests a positive impact. However, before fully deploying the model in production, it is crucial to validate its results, understand the implications of the chosen threshold, and ensure that outreach strategies align with the model’s predictions. The ultimate goal is to balance increased revenue with community benefit, ensuring that homeowners genuinely in need are reached and can benefit from the program, as well as setting the right policies and governance to ensure that eligible applicants are those who deserve to benefit, as we might know there will always be some loopholes in any system, and to ensure that the marketing materials results in a better response rate, we might need to have regular validations and insurance mechanisms that the system is not abused.