Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item.
Here we’ve loaded the data.
I was expecting every column to be a unique item with a 1 or 0 per row depending on whether that item was purchased or not in that transaction.
What we have is for each transaction or row, the first n columns are filled where n is the number of items purchased and the value is the name of the item purchased.
By inspection it looks like quantities aren’t considered so whether
someone bought one or 20 yogurts, it’s listed in the data as
yogurt
.
# Check out packages
library(readr)
library(arules)
library(arulesViz)
# Import data
df <- read_csv("~/Documents/D624/HM10/GroceryDataSet.csv",
col_names = FALSE)
head(df)
## # A tibble: 6 × 32
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 citru… semi… marg… read… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 tropi… yogu… coff… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 3 whole… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 4 pip f… yogu… crea… meat… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 5 other… whol… cond… long… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 6 whole… butt… yogu… rice abra… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## # ℹ 19 more variables: X14 <chr>, X15 <chr>, X16 <chr>, X17 <chr>, X18 <chr>,
## # X19 <chr>, X20 <chr>, X21 <chr>, X22 <chr>, X23 <lgl>, X24 <lgl>,
## # X25 <lgl>, X26 <lgl>, X27 <lgl>, X28 <lgl>, X29 <lgl>, X30 <lgl>,
## # X31 <lgl>, X32 <lgl>
Here are some foundational concepts to understand market basket analysis.
itemset is a pairing or association. For example, in a given
transaction both bread
and butter
are
purchased so bread & butter
is an itemset in that
transaction. Every item listed in the transaction is also an itemset, so
bread & butter
could be considered an item subset but
we’ll still use the term itemset.
Support is the percent of transactions that the itemset
appears in. For example, if bread & butter
appear in 10
of 100 transactions, the support is 10%.
Confidence is how often the rule is true. For example, if
butter
appears in 20% of transactions but the support is
10% then the confidence of butter -> bread
is 10/20 or
50%.
An Association Rule is the relationship between items in an
item set. Above the association rule is butter -> bread
but you could also have bread -> butter
for the same
itemset or other itemsets like sugar -> milk
.
The ->
can be read as “to”.
Antecedent item is the item to the left of the arrow in a
confidence calculation. If we determine the confidence of
butter -> bread
is 50% then butter
is the
antecedent item. It’s also refered to as the Left Hand Side (LHS).
Consequent item is the item to the right of the arrow in a
confidence calculation. If we determine the confidence of
butter -> bread
is 50% then bread
is the
consequent item. It’s also referred to as the Right Hand Side (RHS).
We’re starting to answer the question of IF antecedent item THEN consequent item.
Lift is the strength of an association rule. It’s measured as ‘Confidence’/‘Support of the consequent item’. That could be rewritten as:
(“itemset frequency” / “antecedent item frequency”) / “consequent item frequency”
So if the confidence of bread -> butter
is 50% and
bread
is in 40% of transactions then the lift of
bread -> butter
is 1.25
This means that the presence of butter
increases the
likelihood of bread
being purchased by 25% more than random
chance.
If lift is greater than 1, they are positively associated.
If lift is 1, there is no association.
If lift is less than 1, they are negatively associated.
butter -> bread
and bread -> butter
are Symmetric, meaning their lift will be the same.
Lift of butter -> bread
is 10% / 20% / 40% or
1.25
and
lift of bread -> butter
is 10% / 40% / 20% or
1.25
It doesn’t matter which order you divide the itemset frequency by the two items’ frequency!
Another way to understand the symmetry is lift is the joint
probability of the items occurring together divided by their independent
probabilities. So lift will be the same whether you go from
A -> B
or B -> A
.
\[ \text{Lift} = \frac{P(A \cap B)}{P(A) \cdot P(B)} \]
So we do need to convert our data, but it doesn’t need to be a binary
matrix where each column is a unique item and each cell is a
0
or 1
. Instead we’re making a structured list
of itemsets.
Here’s the first five in the structured list of itemsets we created:
# Create structured list of itemsets
transactions_list <- apply(df, 1, function(row) row[!is.na(row)])
transactions <- as(transactions_list, "transactions")
# display first five itemsets
inspect(transactions[1:5])
## items
## [1] {citrus fruit,
## margarine,
## ready soups,
## semi-finished bread}
## [2] {coffee,
## tropical fruit,
## yogurt}
## [3] {whole milk}
## [4] {cream cheese,
## meat spreads,
## pip fruit,
## yogurt}
## [5] {condensed milk,
## long life bakery product,
## other vegetables,
## whole milk}
Here we mine the data for the association rule and then review the summary of the rules for any early stage insights.
Here we look rules where the itemset is in more than 0.1% of total transactions and the confidence, or how often the rule is true, is above 70%. It’s checking subsets of the itemsets up to a size of six.
# Mine association rules
rules <- apriori(transactions, parameter = list(supp = 0.001, conf = 0.7))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.7 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [1185 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
We have mined 1,185 rules.
There is only one rule with two items, as measured by combining both the lefthand side and the righthand side. The majority of rules (692) are with four items and 24 rules are with six items.
The support, or frequency of the itemset or subset, ranges from 0.1017% to 0.5694%.
The confidence, or how often the rule is true, ranges from 70% to 100%.
The lift, or how the presence of the antecedent item(s) influences the likelihood of the consequent item(s) also being present, ranges from 174% to 1026%.
# Show summary of association rules
summary(rules)
## set of 1185 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6
## 1 126 692 342 24
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 4.000 4.000 4.221 5.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001017 Min. :0.7000 Min. :0.001017 Min. : 2.740
## 1st Qu.:0.001017 1st Qu.:0.7222 1st Qu.:0.001423 1st Qu.: 2.935
## Median :0.001220 Median :0.7647 Median :0.001525 Median : 3.355
## Mean :0.001347 Mean :0.7791 Mean :0.001750 Mean : 3.634
## 3rd Qu.:0.001423 3rd Qu.:0.8235 3rd Qu.:0.001932 3rd Qu.: 3.976
## Max. :0.005694 Max. :1.0000 Max. :0.008134 Max. :11.264
## count
## Min. :10.00
## 1st Qu.:10.00
## Median :12.00
## Mean :13.25
## 3rd Qu.:14.00
## Max. :56.00
##
## mining info:
## data ntransactions support confidence
## transactions 9835 0.001 0.7
## call
## apriori(data = transactions, parameter = list(supp = 0.001, conf = 0.7))
The number one rule by lift is if someone bought liquor
and red/blush wine
they were 1026% more likely than chance
to buy bottled beer
. My takeaway from this is that stores
with a license to sell liquor should sell wine, liquor and beer.
Support: The likelihood of having the itemset with
liquor & red/blush wine & bottled beer
was 0.19% of
transactions.
Confidence: The likelihood that this rule was true was
90.5%. So 0.19% divided by the percentage of transactions that had just
liquor & red/blush wine
.
Lift: The lift is 11.26. So the presence of
liquor & red/blush wine
increases the likelihood of
also buying bottled beer
by 1026%.
Here are the top ten rules by lift:
# Filter and sort by lift
top_rules <- sort(rules, by = "lift", decreasing = TRUE)
inspect(top_rules[1:10])
## lhs rhs support confidence coverage lift count
## [1] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 0.002135231 11.263713 19
## [2] {citrus fruit,
## fruit/vegetable juice,
## other vegetables,
## soda} => {root vegetables} 0.001016777 0.9090909 0.001118454 8.340400 10
## [3] {oil,
## other vegetables,
## tropical fruit,
## whole milk,
## yogurt} => {root vegetables} 0.001016777 0.9090909 0.001118454 8.340400 10
## [4] {citrus fruit,
## fruit/vegetable juice,
## grapes} => {tropical fruit} 0.001118454 0.8461538 0.001321810 8.063879 11
## [5] {other vegetables,
## rice,
## whole milk,
## yogurt} => {root vegetables} 0.001321810 0.8666667 0.001525165 7.951182 13
## [6] {oil,
## other vegetables,
## tropical fruit,
## whole milk} => {root vegetables} 0.001321810 0.8666667 0.001525165 7.951182 13
## [7] {ham,
## other vegetables,
## pip fruit,
## yogurt} => {tropical fruit} 0.001016777 0.8333333 0.001220132 7.941699 10
## [8] {beef,
## citrus fruit,
## other vegetables,
## tropical fruit} => {root vegetables} 0.001016777 0.8333333 0.001220132 7.645367 10
## [9] {fruit/vegetable juice,
## grapes,
## other vegetables} => {tropical fruit} 0.001118454 0.7857143 0.001423488 7.487888 11
## [10] {butter,
## domestic eggs,
## other vegetables,
## whole milk,
## yogurt} => {tropical fruit} 0.001016777 0.7692308 0.001321810 7.330799 10
Here we display the cluster analysis as a visualization. You can see lift as the redness of the connectors, support as the size of the connectors, and confidence should be the length or thickness of the arrows but I’m not convinced it’s rendered in the graph.
Take the liquor
, red/blush wine
and
bottled beer
connector in the bottom left.
liquor
and red/blush wine
have arrows to the
connector so the are the LHS or antecedent items, and
bottled beer
has an arrow away from the connector so it is
the RHS or consequent item. The red is roughly an 11 so if someone buys
the antecedent items they are ~1000% more likely to also buy the
consequent item, bottled beer
Take tropical fruit
, there are many arrows that lead to
it. The one coming from the red dot to its left has arrows pointing to
that red dot from citrus fruit
,
fruit/vegetable juice
, and grapes
. So if
someone buys those three things they are ~700% more likely than random
chance to also buy tropical fruit
.
# Visualize rules
sink(tempfile()) # suppresses the text output
plot(top_rules[1:10], method = "graph", control = list(type = "items"))
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
It may be that we should restrict the support to be higher than 0.1%, the confidence higher than 70%, or number of items considered in an association rule to fewer than six for better results. We will consider that for a future exercise.
It may be interesting to try and surface rules with less confidence. There could be interesting relationships overshadowed by association rules with higher confidence.
We don’t know how far apart the items were in the store that our transactions dataset came from. If we had multiple datasets paired with distance information then we could start to model where items should be placed to maximize profit, potentially constrained for rules about intuitive and orderly placement.
Maybe a consortium, like the (made up) National Consortium of Tomato Producers, could do a project where they look at all products with associations to tomatoes to get ideas for marketing campaigns. All they need is one more feta and tomato bake pasta recipe to go viral.