Notes on compiling this document:
In the code chunk above (entitled “setup”) echo is set to TRUE. This means that the code in your chunks will be displayed, along with the results, in your compiled document.
Below is code to clean and prepare the data set for modeling. Before running that code, follow these preparatory steps:
Download the RMarkdown template and the data sets for the assignment from Canvas.
Copy or move these files from your downloads folder to a folder dedicated to this class–say, MKTG-6487.
You need to define this folder as your “working directory.” To do so, navigate to that folder using the files tab in the lower right quadrant in RStudio. (You should see your files you moved into this folder in the previous step.) Click the “More” button in the menu under the Files tab and select “Set As Working Directory.”
Once the files are in the right location on your computer then run this code to clean and format the data:
# You must run this code to format the data set properly!
advise_invest <- read_csv("adviseinvest.csv") %>% # Download data
select(-product) %>% # Remove the product column
filter(income > 0, # Filter out mistaken data
num_accts < 5) %>%
mutate(answered = factor(ifelse(answered==0, "no","yes"), # Turn answered into yes/no factor
levels = c("no", "yes")),
female = factor(female), # Make categorical variables into factors
job = factor(job),
rent = factor(rent),
own_res = factor(own_res),
new_car = factor(new_car),
mobile = factor(mobile),
chk_acct = factor(chk_acct),
sav_acct = factor(sav_acct))
## Rows: 29502 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (14): answered, income, female, age, job, num_dependents, rent, own_res,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
And here is code to load the data set of prospective customers from your working directory. Note that in order to use this data set for prediction, the variables need to be formatted exactly the same as in the data used to fit the model. It does not include a target variable because the event of answering or not answering has not happened yet for scheduled customers.
prospective <- read_csv("customer_data.csv") %>%
mutate(female = factor(female),
job = factor(job),
rent = factor(rent),
own_res = factor(own_res),
new_car = factor(new_car),
mobile = factor(mobile),
chk_acct = factor(chk_acct),
sav_acct = factor(sav_acct))
## Rows: 1000 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): customer_id
## dbl (12): income, female, age, job, num_dependents, rent, own_res, new_car, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
One of the simplifying assumptions we will make in this project is that all the customers who answer the phone will purchase a product. (This assumption is actually verified by the data.) To model “answered” in this case is therefore equivalent to modeling “purchased.”
There are costs and benefits in this case. We will assume that customers purchase a product for $100 dollars. This was the average cost of AdviseInvest products, according to the Director of Sales. Also, as we learned in the interview, the agent time to make the sale is worth $25. Profit would therefore be $75 dollars for an answered call and a purchase. In sum:
Benefit: True positive. The customer is predicted to answer, does answer, and purchases a product for $100 for a profit of 100 - 25 = $75.
Cost: False positive. The customer is predicted to answer, but does not answer, so there is a loss of $25. (We assume the agent cannot schedule another call at the last minute, or spends the entire time slot trying to make the call.)
For this exercise, we propose that customers who are not predicted to answer will not be called, so there would be no benefits and no costs for them.
However, this proposal is for illustration only. Below you will be asked to come up with a final recommendation for the Director of Sales, and you should feel free to craft a solution—whatever that might be—that fits the details of the case.
One thing to keep in mind for this final phase of the project is that a predictive model is always developed using historical data. The end goal, however, is to predict the future occurrence of the event that has been modeled. In this exercise, you will practice using data on new customers—that is, customers who have not yet been called—to predict whether they will answer. How you use these predictions in solving the business problem is up to you.
#fit tree
tree_model <- rpart(answered ~ .,
data = advise_invest)
tree_model
## n= 29499
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 29499 13375 yes (0.4534052 0.5465948)
## 2) chk_acct=0,1,2 19199 8000 no (0.5833116 0.4166884)
## 4) income>=79840 1728 192 no (0.8888889 0.1111111) *
## 5) income< 79840 17471 7808 no (0.5530880 0.4469120)
## 10) mobile=0 16383 6848 no (0.5820057 0.4179943)
## 20) sav_acct=0,1 13311 4928 no (0.6297799 0.3702201)
## 40) income>=5900 13055 4672 no (0.6421295 0.3578705)
## 80) age< 25.5 3071 640 no (0.7915988 0.2084012)
## 160) chk_acct=0,1 2879 448 no (0.8443904 0.1556096) *
## 161) chk_acct=2 192 0 yes (0.0000000 1.0000000) *
## 81) age>=25.5 9984 4032 no (0.5961538 0.4038462)
## 162) income< 10310 1280 256 no (0.8000000 0.2000000) *
## 163) income>=10310 8704 3776 no (0.5661765 0.4338235)
## 326) income>=11000 8448 3520 no (0.5833333 0.4166667)
## 652) age>=27.5 7424 2880 no (0.6120690 0.3879310)
## 1304) income< 13875 1152 192 no (0.8333333 0.1666667) *
## 1305) income>=13875 6272 2688 no (0.5714286 0.4285714)
## 2610) income>=17960 5568 2176 no (0.6091954 0.3908046)
## 5220) female=1 768 128 no (0.8333333 0.1666667) *
## 5221) female=0 4800 2048 no (0.5733333 0.4266667)
## 10442) job=2 2944 1024 no (0.6521739 0.3478261)
## 20884) income>=26300 1984 448 no (0.7741935 0.2258065) *
## 20885) income< 26300 960 384 yes (0.4000000 0.6000000)
## 41770) new_car=1 384 64 no (0.8333333 0.1666667) *
## 41771) new_car=0 576 64 yes (0.1111111 0.8888889) *
## 10443) job=0,1,3 1856 832 yes (0.4482759 0.5517241)
## 20886) income< 21525 384 64 no (0.8333333 0.1666667) *
## 20887) income>=21525 1472 512 yes (0.3478261 0.6521739) *
## 2611) income< 17960 704 192 yes (0.2727273 0.7272727) *
## 653) age< 27.5 1024 384 yes (0.3750000 0.6250000) *
## 327) income< 11000 256 0 yes (0.0000000 1.0000000) *
## 41) income< 5900 256 0 yes (0.0000000 1.0000000) *
## 21) sav_acct=2,3,4 3072 1152 yes (0.3750000 0.6250000)
## 42) age< 34 1856 896 yes (0.4827586 0.5172414)
## 84) num_accts>=1.5 1216 448 no (0.6315789 0.3684211) *
## 85) num_accts< 1.5 640 128 yes (0.2000000 0.8000000) *
## 43) age>=34 1216 256 yes (0.2105263 0.7894737) *
## 11) mobile=1 1088 128 yes (0.1176471 0.8823529) *
## 3) chk_acct=3 10300 2176 yes (0.2112621 0.7887379)
## 6) income>=38910 2240 1088 no (0.5142857 0.4857143)
## 12) num_accts< 1.5 448 0 no (1.0000000 0.0000000) *
## 13) num_accts>=1.5 1792 704 yes (0.3928571 0.6071429)
## 26) new_car=1 448 64 no (0.8571429 0.1428571) *
## 27) new_car=0 1344 320 yes (0.2380952 0.7619048) *
## 7) income< 38910 8060 1024 yes (0.1270471 0.8729529) *
#visualize
rpart.plot(x = tree_model, tweak = 1.5, roundint = T)
#confusion matrix
table(predicted = predict(tree_model, type = "class"),
observed = advise_invest$answered)
## observed
## predicted no yes
## no 10367 2304
## yes 3008 13820
#calculate profit
13820*75-25*3008
## [1] 961300
# Predict probability -- produces two columns
predict(tree_model, type = "prob")[, 1] %>%
head
## 1 2 3 4 5 6
## 0.1176471 0.1270471 0.1270471 0.8333333 0.8333333 0.2727273
table(predicted = ifelse(predict(tree_model, type = "prob")[ , 1] > .3, "no", "yes"),
observed = advise_invest$answered)
## observed
## predicted no yes
## no 11263 3904
## yes 2112 12220
#calculate profit
12220*75-25*3904
## [1] 818900
glimpse(prospective)
## Rows: 1,000
## Columns: 13
## $ income <dbl> 30010, 30510, 12880, 15530, 46230, 17430, 53020, 27820,…
## $ female <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ age <dbl> 40, 54, 48, 44, 40, 34, 36, 31, 30, 29, 24, 22, 29, 34,…
## $ job <fct> 2, 2, 2, 2, 3, 1, 3, 3, 2, 3, 2, 2, 3, 2, 2, 2, 2, 3, 2…
## $ num_dependents <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ rent <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0…
## $ own_res <fct> 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0…
## $ new_car <fct> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1…
## $ chk_acct <fct> 1, 0, 0, 0, 3, 3, 0, 3, 3, 3, 1, 0, 3, 3, 0, 3, 0, 2, 3…
## $ sav_acct <fct> 0, 0, 1, 0, 1, 1, 0, 2, 0, 0, 0, 0, 1, 0, 0, 4, 1, 0, 0…
## $ num_accts <dbl> 3, 2, 4, 2, 2, 2, 4, 3, 2, 2, 2, 1, 2, 0, 2, 2, 3, 4, 4…
## $ mobile <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ customer_id <chr> "H1597", "P1446", "E1492", "W5143", "W4927", "M6161", "…
#predict probabilities
predict(tree_model,
newdata = prospective,
type = "prob") %>%
head()
## no yes
## 1 0.7741935 0.2258065
## 2 0.7741935 0.2258065
## 3 0.8333333 0.1666667
## 4 0.2727273 0.7272727
## 5 0.2380952 0.7619048
## 6 0.1270471 0.8729529
#new df
predictions <- prospective %>%
select(customer_id) %>%
mutate(answered_prob = predict(tree_model,
newdata = prospective,
type = "prob")[,2])
head(predictions)
## # A tibble: 6 × 2
## customer_id answered_prob
## <chr> <dbl>
## 1 H1597 0.226
## 2 P1446 0.226
## 3 E1492 0.167
## 4 W5143 0.727
## 5 W4927 0.762
## 6 M6161 0.873
#contact list
contact_list <- predictions %>%
filter(answered_prob >= .3) %>%
arrange(desc(answered_prob))
glimpse(contact_list)
## Rows: 624
## Columns: 2
## $ customer_id <chr> "M8502", "V4331", "S3197", "S6624", "C1491", "W3586", "Y…
## $ answered_prob <dbl> 1.0000000, 1.0000000, 1.0000000, 1.0000000, 1.0000000, 1…