Introduction

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item.


Data

Here we’ve loaded the data.

I was expecting every column to be a unique item with a 1 or 0 per row depending on whether that item was purchased or not in that transaction.

What we have is for each transaction or row, the first n columns are filled where n is the number of items purchased and the value is the name of the item purchased.

By inspection it looks like quantities aren’t considered so whether someone bought one or 20 yogurts, it’s listed in the data as yogurt.

# Check out packages
library(readr)
library(arules)
library(arulesViz)

# Import data
df <- read_csv("~/Documents/D624/HM10/GroceryDataSet.csv", 
    col_names = FALSE)

head(df)
## # A tibble: 6 × 32
##   X1     X2    X3    X4    X5    X6    X7    X8    X9    X10   X11   X12   X13  
##   <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 citru… semi… marg… read… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 2 tropi… yogu… coff… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 3 whole… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 4 pip f… yogu… crea… meat… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 5 other… whol… cond… long… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 6 whole… butt… yogu… rice  abra… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## # ℹ 19 more variables: X14 <chr>, X15 <chr>, X16 <chr>, X17 <chr>, X18 <chr>,
## #   X19 <chr>, X20 <chr>, X21 <chr>, X22 <chr>, X23 <lgl>, X24 <lgl>,
## #   X25 <lgl>, X26 <lgl>, X27 <lgl>, X28 <lgl>, X29 <lgl>, X30 <lgl>,
## #   X31 <lgl>, X32 <lgl>



Foundational Concepts

Here are some foundational concepts to understand market basket analysis.


Itemset

itemset is a pairing or association. For example, in a given transaction both bread and butter are purchased so bread & butter is an itemset in that transaction. Every item listed in the transaction is also an itemset, so bread & butter could be considered an item subset but we’ll still use the term itemset.


Support

Support is the percent of transactions that the itemset appears in. For example, if bread & butter appear in 10 of 100 transactions, the support is 10%.


Confidence

Confidence is how often the rule is true. For example, if butter appears in 20% of transactions but the support is 10% then the confidence of butter -> bread is 10/20 or 50%.


Association Rule

An Association Rule is the relationship between items in an item set. Above the association rule is butter -> bread but you could also have bread -> butter for the same itemset or other itemsets like sugar -> milk.

The -> can be read as “to”.


Antecedent and Consequent Items

Antecedent item is the item to the left of the arrow in a confidence calculation. If we determine the confidence of butter -> bread is 50% then butter is the antecedent item. It’s also refered to as the Left Hand Side (LHS).

Consequent item is the item to the right of the arrow in a confidence calculation. If we determine the confidence of butter -> bread is 50% then bread is the consequent item. It’s also referred to as the Right Hand Side (RHS).

We’re starting to answer the question of IF antecedent item THEN consequent item.


Lift

Lift is the strength of an association rule. It’s measured as ‘Confidence’/‘Support of the consequent item’. That could be rewritten as:

(“itemset frequency” / “antecedent item frequency”) / “consequent item frequency”

So if the confidence of bread -> butter is 50% and bread is in 40% of transactions then the lift of bread -> butter is 1.25

This means that the presence of butter increases the likelihood of bread being purchased by 25% more than random chance.

If lift is greater than 1, they are positively associated.

If lift is 1, there is no association.

If lift is less than 1, they are negatively associated.


Symmetry

butter -> bread and bread -> butter are Symmetric, meaning their lift will be the same.

Lift of butter -> bread is 10% / 20% / 40% or 1.25

and

lift of bread -> butter is 10% / 40% / 20% or 1.25

It doesn’t matter which order you divide the itemset frequency by the two items’ frequency!

Another way to understand the symmetry is lift is the joint probability of the items occurring together divided by their independent probabilities. So lift will be the same whether you go from A -> B or B -> A.

\[ \text{Lift} = \frac{P(A \cap B)}{P(A) \cdot P(B)} \]




Data Transformation

So we do need to convert our data, but it doesn’t need to be a binary matrix where each column is a unique item and each cell is a 0 or 1. Instead we’re making a structured list of itemsets.

Here’s the first five in the structured list of itemsets we created:

# Create structured list of itemsets
transactions_list <- apply(df, 1, function(row) row[!is.na(row)])
transactions <- as(transactions_list, "transactions")

# display first five itemsets
inspect(transactions[1:5])
##     items                      
## [1] {citrus fruit,             
##      margarine,                
##      ready soups,              
##      semi-finished bread}      
## [2] {coffee,                   
##      tropical fruit,           
##      yogurt}                   
## [3] {whole milk}               
## [4] {cream cheese,             
##      meat spreads,             
##      pip fruit,                
##      yogurt}                   
## [5] {condensed milk,           
##      long life bakery product, 
##      other vegetables,         
##      whole milk}



Mining Association Rules

Here we mine the data for the association rule and then review the summary of the rules for any early stage insights.


Mine Rules

Here we look rules where the itemset is in more than 0.1% of total transactions and the confidence, or how often the rule is true, is above 70%. It’s checking subsets of the itemsets up to a size of six.

# Mine association rules
rules <- apriori(transactions, parameter = list(supp = 0.001, conf = 0.7))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.7    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [1185 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].


Rules Summary

We have mined 1,185 rules.

There is only one rule with two items, as measured by combining both the lefthand side and the righthand side. The majority of rules (692) are with four items and 24 rules are with six items.

Support

The support, or frequency of the itemset or subset, ranges from 0.1017% to 0.5694%.

Confidence

The confidence, or how often the rule is true, ranges from 70% to 100%.

Lift

The lift, or how the presence of the antecedent item(s) influences the likelihood of the consequent item(s) also being present, ranges from 174% to 1026%.

# Show summary of association rules
summary(rules)
## set of 1185 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4   5   6 
##   1 126 692 342  24 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   4.000   4.000   4.221   5.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001017   Min.   :0.7000   Min.   :0.001017   Min.   : 2.740  
##  1st Qu.:0.001017   1st Qu.:0.7222   1st Qu.:0.001423   1st Qu.: 2.935  
##  Median :0.001220   Median :0.7647   Median :0.001525   Median : 3.355  
##  Mean   :0.001347   Mean   :0.7791   Mean   :0.001750   Mean   : 3.634  
##  3rd Qu.:0.001423   3rd Qu.:0.8235   3rd Qu.:0.001932   3rd Qu.: 3.976  
##  Max.   :0.005694   Max.   :1.0000   Max.   :0.008134   Max.   :11.264  
##      count      
##  Min.   :10.00  
##  1st Qu.:10.00  
##  Median :12.00  
##  Mean   :13.25  
##  3rd Qu.:14.00  
##  Max.   :56.00  
## 
## mining info:
##          data ntransactions support confidence
##  transactions          9835   0.001        0.7
##                                                                      call
##  apriori(data = transactions, parameter = list(supp = 0.001, conf = 0.7))



Top 10 Rules by Lift

The number one rule by lift is if someone bought liquor and red/blush wine they were 1026% more likely than chance to buy bottled beer. My takeaway from this is that stores with a license to sell liquor should sell wine, liquor and beer.

Support: The likelihood of having the itemset with liquor & red/blush wine & bottled beer was 0.19% of transactions.

Confidence: The likelihood that this rule was true was 90.5%. So 0.19% divided by the percentage of transactions that had just liquor & red/blush wine.

Lift: The lift is 11.26. So the presence of liquor & red/blush wine increases the likelihood of also buying bottled beer by 1026%.

Here are the top ten rules by lift:

# Filter and sort by lift
top_rules <- sort(rules, by = "lift", decreasing = TRUE)
inspect(top_rules[1:10])
##      lhs                         rhs                   support confidence    coverage      lift count
## [1]  {liquor,                                                                                        
##       red/blush wine}         => {bottled beer}    0.001931876  0.9047619 0.002135231 11.263713    19
## [2]  {citrus fruit,                                                                                  
##       fruit/vegetable juice,                                                                         
##       other vegetables,                                                                              
##       soda}                   => {root vegetables} 0.001016777  0.9090909 0.001118454  8.340400    10
## [3]  {oil,                                                                                           
##       other vegetables,                                                                              
##       tropical fruit,                                                                                
##       whole milk,                                                                                    
##       yogurt}                 => {root vegetables} 0.001016777  0.9090909 0.001118454  8.340400    10
## [4]  {citrus fruit,                                                                                  
##       fruit/vegetable juice,                                                                         
##       grapes}                 => {tropical fruit}  0.001118454  0.8461538 0.001321810  8.063879    11
## [5]  {other vegetables,                                                                              
##       rice,                                                                                          
##       whole milk,                                                                                    
##       yogurt}                 => {root vegetables} 0.001321810  0.8666667 0.001525165  7.951182    13
## [6]  {oil,                                                                                           
##       other vegetables,                                                                              
##       tropical fruit,                                                                                
##       whole milk}             => {root vegetables} 0.001321810  0.8666667 0.001525165  7.951182    13
## [7]  {ham,                                                                                           
##       other vegetables,                                                                              
##       pip fruit,                                                                                     
##       yogurt}                 => {tropical fruit}  0.001016777  0.8333333 0.001220132  7.941699    10
## [8]  {beef,                                                                                          
##       citrus fruit,                                                                                  
##       other vegetables,                                                                              
##       tropical fruit}         => {root vegetables} 0.001016777  0.8333333 0.001220132  7.645367    10
## [9]  {fruit/vegetable juice,                                                                         
##       grapes,                                                                                        
##       other vegetables}       => {tropical fruit}  0.001118454  0.7857143 0.001423488  7.487888    11
## [10] {butter,                                                                                        
##       domestic eggs,                                                                                 
##       other vegetables,                                                                              
##       whole milk,                                                                                    
##       yogurt}                 => {tropical fruit}  0.001016777  0.7692308 0.001321810  7.330799    10



Simple Cluster Analysis

Here we display the cluster analysis as a visualization. You can see lift as the redness of the connectors, support as the size of the connectors, and confidence should be the length or thickness of the arrows but I’m not convinced it’s rendered in the graph.

Take the liquor, red/blush wine and bottled beer connector in the bottom left. liquor and red/blush wine have arrows to the connector so the are the LHS or antecedent items, and bottled beer has an arrow away from the connector so it is the RHS or consequent item. The red is roughly an 11 so if someone buys the antecedent items they are ~1000% more likely to also buy the consequent item, bottled beer

Take tropical fruit, there are many arrows that lead to it. The one coming from the red dot to its left has arrows pointing to that red dot from citrus fruit, fruit/vegetable juice, and grapes. So if someone buys those three things they are ~700% more likely than random chance to also buy tropical fruit.

# Visualize rules
sink(tempfile()) # suppresses the text output
plot(top_rules[1:10], method = "graph", control = list(type = "items"))
## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE




Future Exploration

It may be that we should restrict the support to be higher than 0.1%, the confidence higher than 70%, or number of items considered in an association rule to fewer than six for better results. We will consider that for a future exercise.

It may be interesting to try and surface rules with less confidence. There could be interesting relationships overshadowed by association rules with higher confidence.

We don’t know how far apart the items were in the store that our transactions dataset came from. If we had multiple datasets paired with distance information then we could start to model where items should be placed to maximize profit, potentially constrained for rules about intuitive and orderly placement.

Maybe a consortium, like the (made up) National Consortium of Tomato Producers, could do a project where they look at all products with associations to tomatoes to get ideas for marketing campaigns. All they need is one more feta and tomato bake pasta recipe to go viral.