Course: MBA563 ************************************************************************
# load needed packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Confusion Matrix function
my_confusion_matrix <- function(cf_table) {
true_positive <- cf_table[4]
true_negative <- cf_table[1]
false_positive <- cf_table[2]
false_negative <- cf_table[3]
accuracy <- (true_positive + true_negative) / (true_positive + true_negative + false_positive + false_negative)
sensitivity_recall <- true_positive / (true_positive + false_negative)
specificity_selectivity <- true_negative / (true_negative + false_positive)
precision <- true_positive / (true_positive + false_positive)
neg_pred_value <- true_negative/(true_negative + false_negative)
print(cf_table)
my_list <- list(sprintf("%1.0f = True Positive (TP), Hit", true_positive),
sprintf("%1.0f = True Negative (TN), Rejection", true_negative),
sprintf("%1.0f = False Positive (FP), Type 1 Error", false_positive),
sprintf("%1.0f = False Negative (FN), Type 2 Error", false_negative),
sprintf("%1.4f = Accuracy (TP+TN/(TP+TN+FP+FN))", accuracy),
sprintf("%1.4f = Sensitivity, Recall, Hit Rate, True Positive Rate (How many positives did the model get right? TP/(TP+FN))", sensitivity_recall),
sprintf("%1.4f = Specificity, Selectivity, True Negative Rate (How many negatives did the model get right? TN/(TN+FP))", specificity_selectivity),
sprintf("%1.4f = Precision, Positive Predictive Value (How good are the model's positive predictions? TP/(TP+FP))", precision),
sprintf("%1.4f = Negative Predictive Value (How good are the model's negative predictions? TN/(TN+FN)", neg_pred_value)
)
return(my_list)
}
We will now us the decision tree algorithm to address our business problem.
TECA is planning for the future and would like to set up their business so they are not so reliant on selling gasoline. One way to do that is to increase the sales of profitable products. Thus, TECA would like to predict when a transaction is a going to be a sale of a high gross profit margin product. Profit margin is the percentage of gross profit to revenue, or revenue minus costs divided by revenue. It is a percentage of how much profit each product makes or what percentage of profit is earned for each dollar of revenue.
The first thing we need to do is bring in the dataset we are going to work with. This is TECA data that has been transformed to be used in our decision tree analysis. Let’s load it and examine it a bit.
We first notice that the dataset has 200,000 rows and 13 features. This data is aggregated at the level of a purchase. Thus, each example or row is the purchase of one item.
The first feature is the target feature or the dependent
variable. This is the variable we are trying to predict. It is called
high_gpm. It is a factor that is either low or
high.
The second variable is a categorical variable
loyalty2. This feature indicates whether the purchase was
made by a loyalty customer (a customer who scanned their loyalty card)
or not. Note, two things here. First, note that the data was adjusted so
that about half of the purchases in this dataset were made by loyalty
customers, since TECA is particularly interested in these customers.
Second, note that the decision tree method does not require us to make
all of our variables numerical. Thus, we do not have to convert our
character/categorical features to multiple columns of dummy
variables.
Next we see revenue. This variable is the amount of
revenue TECA makes for each purchase. Revenue is a continuous
variable.
Next we have the quarter feature. This variable
indicates the quarter of the year in which the purchase took
place.
Next, income, bachelors_degree, and
population list these averages for the location of the
store where the purchase took place.
Next, we have another multi-level factor variable,
state_province that lists the state the purchase happened
in.
Next, we have num_trans, which indicates how many
purchases were made by this same person as part of this
transaction.
Next, basket indicates whether the purchase was part
of a multi-part transaction.
Next, refill indicate whether or not the purchase
was a refill of fountain soda.
Area is a multi-level categorical variable that indicates the area of the store the purchase was made, as follows:
alcohol: products in this area include, for example,
wine and beer;cooler: products in this area include, for example,
energy drinks, canned soda, and juice;dispensed: products in this area include, for example,
cold and hot dispensed (fountain) drinks;fresh: products in this area include, for example,
pizza, hot sandwiches, salads, and roller grill items;fuel: products in this area include, for example,
gas;grocery: products in this area include, for example,
milk, eggs, and cheese;lottery: products in this area include, for example,
lottery tickets;miscellaneous: products in this area include, for
example, store services and coupons;nongrocery: products in this area include, for example,
clothing, magazines, medicine, and newspapers;snacks: products in this area include, for example,
candy, gum, chips, and salty snacks; andtobacco: products in this area include, for example,
cigarettes and chewing tobacco.Next, items_sold indicate how many of these items
were purchased in this transaction.
tree_input <- read_rds('tree_input.rds')
str(tree_input)
## 'data.frame': 200000 obs. of 13 variables:
## $ high_gpm : Factor w/ 2 levels "low","high": 1 1 1 1 1 2 2 1 2 1 ...
## $ loyalty2 : Factor w/ 2 levels "not loyal","loyal": 1 1 1 1 1 1 1 1 1 1 ...
## $ revenue : num 6.35 1.33 5 5.74 6.01 ...
## $ quarter : int 3 1 3 2 2 2 4 3 3 1 ...
## $ income : int 43438 58594 59468 38516 52379 61115 69084 44112 54614 65861 ...
## $ bachelors_degree: int 51 1167 3808 665 4088 264 8337 1649 1739 164 ...
## $ population : int 279 8624 22756 9515 41584 2048 51804 19064 23523 1730 ...
## $ state_province : Factor w/ 10 levels "Wyoming","Iowa",..: 5 8 5 3 6 3 4 4 4 2 ...
## $ num_trans : int 1 1 1 1 1 1 1 1 1 1 ...
## $ basket : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ refill : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ area : Factor w/ 11 levels "fuel","alcohol",..: 5 2 10 5 5 8 8 2 7 1 ...
## $ items_sold : num 1 1 1 1 1 1 1 1 1 1 ...
slice_sample(tree_input, n=10)
## high_gpm loyalty2 revenue quarter income bachelors_degree population
## 1595146 high not loyal 1.29 3 37695 713 11159
## 549815 high loyal 1.00 4 40511 1030 15262
## 279497 high loyal 5.00 4 66160 10178 57364
## 205407 high loyal 1.19 2 50473 589 7344
## 430062 low loyal 10.00 2 37755 6309 47090
## 529195 high loyal 3.38 1 58386 7602 39333
## 523472 high loyal 1.69 2 37917 10 546
## 1563726 high not loyal 1.34 1 57946 32 237
## 212536 high loyal 2.59 2 40576 554 6646
## 293694 high not loyal 2.69 3 65625 207 2270
## state_province num_trans basket refill area items_sold
## 1595146 Alabama 1 no no dispensed 1
## 549815 Alabama 1 no no dispensed 1
## 279497 Missouri 1 no no cooler 3
## 205407 Alabama 1 no no snacks 1
## 430062 Missouri 1 no no fuel 1
## 529195 Colorado 1 no no dispensed 2
## 523472 Arkansas 1 no no dispensed 1
## 1563726 Iowa 1 no no fresh 1
## 212536 Missouri 1 no no cooler 1
## 293694 Iowa 1 no no nongrocery 1
Let’s look more in depth at the target feature. For tree analysis, the target feature is a categorical variable. In this implementation we can leave it as a factor. About 44% of these purchases are for high gross profit items.
freq <- table(tree_input$high_gpm)
freq[2]/(freq[1]+freq[2])
## high
## 0.4341
contrasts(tree_input$high_gpm)
## high
## low 0
## high 1
Before using the algorithm, we need to prepare the data. The first
line below loads the caret package. The next line sets the
seed for the randomization that will be used for the data split. The
caret::createDataPartition() function uses the caret
package to split the data. Basically, it creates a number for
p amount, in this case 0.75, of the target feature and
lists these numbers in a matrix (since list is
FALSE). Next, the training data and testing data are
created. These use the numbers from the partition matrix
that we just created. data_train retains each of the rows
with numbers in partition while data_test
takes the numbers not in partition, i.e., -partition. We
use the set.seed() function with the same number every time
to get a reproducible result every time we run this code, since the
function relies on R’s random number generator. Of course, your results
might not exactly replicate mine if your R version is different,
etc.
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
set.seed(77)
partition <- caret::createDataPartition(y=tree_input$high_gpm, p=.75, list=FALSE)
data_train <- tree_input[partition, ]
data_test <- tree_input[-partition, ]
For decision trees, which don’t rely on distance to calculate splits,
we do not need to standardize our data. Thus, we are ready to run our
model. The tree() function creates our tree object. The
first argument in the function is our formula. We are predicting the
feature variable, high_gpm on the left-hand side of the
~, and using all of our other features to predict it, as
indicated by .. We do this, of course, with our training
data. As usual, we hold back our test data to evaluate our model
later.
#install.packages('tree')
library(tree)
model_tree <- tree(high_gpm ~ ., data_train)
First, before we explore the tree and find out what it tells us, let’s see how accurate our model is. If this model is going to help TECA makes decisions, it needs to be accurate at predicting whether a purchase will be for a high gross profit margin product or not.
The first step in this process is using the model we trained on our
training data to see how well it predicts on our holdout testing data.
The predict function for the tree package does this for us.
Notice that the testing data is passed as the second argument. The
type='class' argument tells R to give the output as the
actual class label–low and high in our
case.
predict_tree <- predict(model_tree, data_test, type='class')
Next, let’s use the confusion matrix function created at the beginning of the notebook. Overall, we find that our model is very accurate. The model makes the correct prediction about 85% of the time!
Let’s explore some details of the accuracy of this model. The table shows the following output: * When a transaction is actually/truthfully a high gross profit margin transaction, the model correctly classifies it as such-by saying “high”, 16186 times. This is called a True Positive (TP), Hit.
When a transaction is truthfully a low gross profit margin transaction, the model correctly classifies it as such–by saying “low”, 26281 times. This is a True Negative (TN), Rejection.
On the other hand, the model makes two kinds of errors. When the model transaction is not a high gross profit margin transaction, but the model incorrectly says it is, this is called a False Positive (FP), Type 1 Error. It happens 2014 times.
Finally, when a transaction is a high gross profit margin transaction and the model says it is not, which here happens 5519 times, it is called a False Negative (FN), Type 2 Error.
These numbers can then be manipulated to create different measures of accuracy, as follows: * Overall accuracy (TP+TN/(TP+TN+FP+FN)) is 0.8493.
Sensitivity, Recall, Hit Rate, True Positive Rate (How many positives did the model get right? TP/(TP+FN)), is 0.7457.
Specificity, Selectivity, True Negative Rate (How many negatives did the model get right? TN/(TN+FP)) is 0.9288.
Precision, Positive Predictive Value (How good are the model’s positive predictions? TP/(TP+FP)) is 0.8893.
Negative Predictive Value (How good are the model’s negative predictions? TN/(TN+FN) is 0.8264.
Overall, our model does a great job at predicting when a purchase will be high profit margin versus low profit margin and gets it right 85% of the time. The aspect of the model that is the least effective is its sensitivity and the aspect that is the most effective is the specificity. This means that the model is very good at classifying the low margin purchases successfully (specificity), but slightly worse at classifying the high margin purchases correctly (sensitivity). Thus, while the model is quite good overall, it particularly excels at identifying bad transactions. As we examine the tree below, it should become clear why low margin products are a bit easier to classify.
table1 <- table(predict_tree, data_test$high_gpm)
my_confusion_matrix(table1)
##
## predict_tree low high
## low 26281 5519
## high 2014 16186
## [[1]]
## [1] "16186 = True Positive (TP), Hit"
##
## [[2]]
## [1] "26281 = True Negative (TN), Rejection"
##
## [[3]]
## [1] "2014 = False Positive (FP), Type 1 Error"
##
## [[4]]
## [1] "5519 = False Negative (FN), Type 2 Error"
##
## [[5]]
## [1] "0.8493 = Accuracy (TP+TN/(TP+TN+FP+FN))"
##
## [[6]]
## [1] "0.7457 = Sensitivity, Recall, Hit Rate, True Positive Rate (How many positives did the model get right? TP/(TP+FN))"
##
## [[7]]
## [1] "0.9288 = Specificity, Selectivity, True Negative Rate (How many negatives did the model get right? TN/(TN+FP))"
##
## [[8]]
## [1] "0.8893 = Precision, Positive Predictive Value (How good are the model's positive predictions? TP/(TP+FP))"
##
## [[9]]
## [1] "0.8264 = Negative Predictive Value (How good are the model's negative predictions? TN/(TN+FN)"
Next, let’s examine the details of the tree we have created and what
it might tell us that might help TECA increase their sales of profitable
products. We can start this process very simply by using the
summary() function on our tree object,
model_tree. We learn a couple very useful things here.
First, we learn that only two variables were used in our
tree–area and revenue. Next, we are told how
many terminal nodes exist. Finally, we are given some error
measurements. Residual mean deviance is a measure of variance in the
model and misclassification error rate is measure of how many examples
were misclassified. It shows that 22,723, or about 15%, of our total
observations were incorrectly classified (split into the wrong
partition).
summary(model_tree)
##
## Classification tree:
## tree(formula = high_gpm ~ ., data = data_train)
## Variables actually used in tree construction:
## [1] "area" "revenue"
## Number of terminal nodes: 6
## Residual mean deviance: 0.5741 = 86110 / 150000
## Misclassification error rate: 0.1515 = 22723 / 150000
Next, let’s look at the tree in text form. To do this, we simply type
the name of the tree object that we just created. This is our tree,
albeit in text form. It gives us quite a lot of information. the first
line is the key to reading the table,
node), split, n, deviance, yval, (yprob). It says that each
line starts with a numbered node, lists the equation used to split the
data, the number of observations following the left side of the branch,
the deviance associate with the branch, the predicted value at the node,
and the proportion of the values at the branch that are absent and
present. This is very informative, but let’s also plot the tree so that
we can get a fuller picture (literally) and discuss what the tree is
telling us.
model_tree
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 150000 205300.00 low ( 5.659e-01 4.341e-01 )
## 2) area: fuel,alcohol,tobacco,lottery,miscellaneous 52207 64.59 low ( 9.999e-01 5.746e-05 ) *
## 3) area: snacks,fresh,grocery,cooler,dispensed,nongrocery 97793 124600.00 high ( 3.342e-01 6.658e-01 )
## 6) area: snacks,grocery,cooler 51950 71380.00 low ( 5.553e-01 4.447e-01 )
## 12) revenue < 1.495 8775 10050.00 high ( 2.597e-01 7.403e-01 ) *
## 13) revenue > 1.495 43175 57530.00 low ( 6.154e-01 3.846e-01 ) *
## 7) area: fresh,dispensed,nongrocery 45843 26360.00 high ( 8.363e-02 9.164e-01 )
## 14) area: fresh,nongrocery 22767 20640.00 high ( 1.684e-01 8.316e-01 )
## 28) revenue < 2.21 11985 14060.00 high ( 2.733e-01 7.267e-01 ) *
## 29) revenue > 2.21 10782 4397.00 high ( 5.185e-02 9.482e-01 ) *
## 15) area: dispensed 23076 0.00 high ( 0.000e+00 1.000e+00 ) *
These two functions plot the tree. The first one gives the outline
and the second one inserts the text. The heights of the lines are
proportional to the decrease in impurity. Thus, the longer the line, the
more helpful the split at reducing messiness in the model. The
all=TRUE option adds extra labels, cex=.75
provides the size of things, font=2 changes the font,
digits=2 changes the number of digits following the decimal
point, and pretty=0 keeps the labels in the plot.
So, what do we learn from the text above and the plot? As we move through the tree, moving to the left is “yes” while moving to the right is “no”. “Low” and “high” at the leaf nodes indicates that is the prediction of the profitability of the purchase.
We actually learn a lot that can help TECA. The first, and most significant split that determines whether a purchase will be for a high or low gross profit is whether the purchases is for fuel, alcohol, tobacco, lottery, or miscellaneous. If the answer to this is yes, we move to the left of the tree and the product is very likely to be low profitability. While the management team at TECA surely knows which products are high and low profit. This stark result should make it clear that if individual profitability is the primary concern of management it is important to 1) get customers into the store, and 2) get customers to move beyond just the traditional “vice” products. To the extent management can convert its stores to locations where people do more than just grab cigarettes, beer and lottery tickets after grabbing some gas, the more profitable they will be. Clearly there are other considerations for management, and these products drive traffic into the store and increase revenue, but management can use this knowledge to increase the purchase of higher margin products, perhaps even in concert with these vice products.
Let’s explore the rest of the plot. If participants instead buy snacks, fresh, grocery, cooler, dispensed, or nongrocery items the model gets a bit more complicated. If customers buy snacks, grocery, or cooler items the profitability of the purchase depends on revenue, such that lower profit actually is more likely to lead to higher profitability. This is also likely useful information for TECA. This helps them realize that even the smaller priced items, from the correct areas of the store, are profitable. This might lead to promotion of the products by placing them in the front of coolers or on the counter at checkout. If the customer instead purchases fresh, dispensed, or nongrocery items the data is split between dispensed and fresh and nongrocery. Dispensed items are predicted to be high profit. Thus, TECA should continue promoting and facilitating fountain drinks and coffee. Finally, if fresh food and nongrocery items are purchased, profitability is also predicted to be high, whether revenue is above or below $2.21.
plot(model_tree)
text(model_tree, all=TRUE, cex=.75, font=2, digits=2, pretty=0)