Project 5: Model Evaluation and Deployment

Notes on compiling this document:

Change the information in the yaml header above: title and author.
Make sure the output argument is set correctly. It should read: output: html_document or output: word_document.
Once you are finished writing the code necessary to answer the questions in the quiz, clear your environment by clicking on the broom icon in the environment pane (upper right quadrant).
Run each code chunk individually (click the green arrow icon in the upper right of the chunk). Start at the top of this document and proceed sequentially to the bottom. Fix any code errors you find.
Once your code is error-free, click “knit” in the menu above. Your document should compile to HTML, if the output is set to “html_document” (or to word if the output is set to “word_document”).

In the code chunk above (entitled “setup”) echo is set to TRUE. This means that the code in your chunks will be displayed, along with the results, in your compiled document.

Load and Transform Data

Below is code to clean and prepare the data set for modeling. Before running that code, follow these preparatory steps:

Download the RMarkdown template and the data sets for the assignment from Canvas.
Copy or move these files from your downloads folder to a folder dedicated to this class–say, MKTG-6487.
You need to define this folder as your “working directory.” To do so, navigate to that folder using the files tab in the lower right quadrant in RStudio. (You should see your files you moved into this folder in the previous step.) Click the “More” button in the menu under the Files tab and select “Set As Working Directory.”

Once the files are in the right location on your computer then run this code to clean and format the data:

# You must run this code to format the data set properly!

advise_invest <- read_csv("~/MBA/MKTG 6487/Project/adviseinvest (1).csv", show_col_types = FALSE)  |>            # Download data
  select(-product) |>                                        # Remove the product column
  filter(income > 0,                                          # Filter out mistaken data
         num_accts < 5) |> 
  mutate(answered = factor(ifelse(answered==0, "no","yes"),   # Turn answered into yes/no factor
                           levels  = c("yes", "no")),
         female = factor(female),                             # Make categorical variables into factors
         job = factor(job),
         rent = factor(rent),
         own_res = factor(own_res),
         new_car = factor(new_car),
         mobile = factor(mobile),
         chk_acct = factor(chk_acct),
         sav_acct = factor(sav_acct))

And here is code to load the data set of prospective customers from your working directory. Note that in order to use this data set for prediction, the variables need to be formatted exactly the same as in the data used to fit the model. It does not include a target variable because the event of answering or not answering has not happened yet for scheduled customers.

prospective <- read_csv("~/MBA/MKTG 6487/Project/adviseinvest_new_customer.csv", show_col_types = FALSE) |> 
  mutate(female = factor(female),
         job = factor(job),
         rent = factor(rent),
         own_res = factor(own_res),
         new_car = factor(new_car),
         mobile = factor(mobile),
         chk_acct = factor(chk_acct),
         sav_acct = factor(sav_acct))

Questions

Read the instructions for this phase of the project at Canvas.

Q1.

# Fit the tree model using all available predictors
tree_model <- rpart(answered ~ ., data = advise_invest, method = "class")

# Predict the class labels using the default 0.5 probability threshold
predicted_classes <- predict(tree_model, newdata = advise_invest, type = "class")

# Create a confusion matrix comparing predicted classes to actual labels
confusion_matrix <- table(Predicted = predicted_classes, Actual = advise_invest$answered)
print("Confusion Matrix:")

## [1] "Confusion Matrix:"

print(confusion_matrix)

##          Actual
## Predicted   yes    no
##       yes 13820  3008
##       no   2304 10367

Q2

# Compute counts from the confusion matrix
# True Positives (TP): predicted "yes" and observed "yes"
TP <- confusion_matrix["yes", "yes"]

# False Positives (FP): predicted "yes" but observed "no"
FP <- confusion_matrix["yes", "no"]

# True Negatives (TN): predicted "no" and observed "no"
TN <- confusion_matrix["no", "no"]

# False Negatives (FN): predicted "no" but observed "yes"
FN <- confusion_matrix["no", "yes"]


# Calculate expected profit using the cost-benefit matrix:
# Benefit: $75 for a true positive (purchase: $100 benefit minus $25 agent cost)
# Cost: -$25 for a false positive (agent cost incurred when customer does not answer)
profit <- (TP * 75) + (FP * (-25))
# Ensure that profit is not negative (if negative, set to 0)
profit <- ifelse(profit < 0, 0, profit)

cat("Expected Profit: $", profit, "\n")

## Expected Profit: $ 961300

Q3

# Predict probabilities for prospective customers using the tree model
predicted_probs <- predict(tree_model, newdata = prospective, type = "prob")[, "yes"]

# Assign a predicted class label of "yes" if the probability is >= 0.3
prospective <- prospective %>% 
  mutate(prob_answer = predicted_probs,
         pred_answer = if_else(prob_answer >= 0.3, "yes", "no"))

# Create contact list: filter only those predicted "yes"
contact_list <- prospective %>% 
  filter(pred_answer == "yes")

# Number of customers on the contact list
num_contacts <- nrow(contact_list)
cat("Number of Prospective Customers on the Contact List:", num_contacts, "\n")

## Number of Prospective Customers on the Contact List: 624

Q4

#The analysis suggests that targeting prospective customers with a model-estimated probability of answering of 30% or more enhances operational efficiency and profitability. The decision tree model developed from historical data shows that when a customer answers, the net profit is $75 (after accounting for agent time at $25 per call). Conversely, calling a customer who does not answer costs $25.

#Based on the cost-benefit analysis, filtering the contact list to include only those customers with at least a 30% chance of answering minimizes the risk of incurring costs on calls that are unlikely to result in a sale. By implementing this threshold, sales representatives will be assigned only to prospective customers with a higher likelihood of engagement, optimizing staffing use and reducing idle time.

#Operationalizing this approach involves integrating the model into the customer scheduling process. For every prospective customer, compute the probability of an answered call. The system should automatically flag those individuals with predicted probabilities above 30% for follow-up calls. In addition, continual monitoring of model performance and periodic recalibration of the threshold as more data becomes available can further refine the approach, ensuring that the sales team focuses on high-probability engagements and maximizes overall profitability.

Project 5: Model Evaluation and Deployment

Dustin Schouten

Load and Transform Data

Questions

Q1.

Q2

Q3

Q4

Q5