Association Rules - Market Basket Analysis of Bakery Sales

1 Introduction

The objective of this paper is to apply Association Rules to analyse Bakery basket sales data.The data set under scrutiny in this article is sourced from Kaggle (https://www.kaggle.com/datasets/akashdeepkuila/bakery).The dataset originates from “The Bread Basket,” a bakery situated in Edinburgh. It contains transaction details of customers who placed orders for various items from this bakery between October 30, 2016, and April 9, 2017.

# Importing necessary libraries for whole project
library(arules)
library(arulesViz)
library(latex2exp)

The dataset comprises over 6,000 transactions, and for the purpose of analysis I decided to choose from it two columns:

TransactionID: Unique identifier for every single transaction
Items: Items purchased

bakery <- read.csv("Bakery.csv")[, c("TransactionNo", "Items")]

write.csv(bakery, "filtered_bakery.csv", row.names = FALSE, quote = FALSE)

bakery_transactions <- read.transactions(
  "filtered_bakery.csv", 
  format = "single",
  cols = c(1, 2),
  sep = "," 
)

The following example illustrates the data.

inspect(head(bakery_transactions, 20))

##      items                                  transactionID
## [1]  {Bread}                                1            
## [2]  {Medialuna, Scandinavian}              10           
## [3]  {Chimichurri Oil, Scandinavian}        1000         
## [4]  {Bread, Truffles}                      1001         
## [5]  {Brownie, Focaccia}                    1002         
## [6]  {Bread, Coffee}                        1003         
## [7]  {Art Tray, Coffee, Cookies, Tea}       1004         
## [8]  {Coffee}                               1005         
## [9]  {Bread}                                1006         
## [10] {Alfajores, Coffee, Coke}              1007         
## [11] {Bread}                                1008         
## [12] {Bread}                                1009         
## [13] {Bread, Coffee, Medialuna, Pastry}     1010         
## [14] {Coffee}                               1011         
## [15] {Bread, Keeping It Local}              1012         
## [16] {Bread}                                1013         
## [17] {Coffee, Scandinavian}                 1014         
## [18] {Bread, Farm House, Medialuna, Pastry} 1015         
## [19] {Medialuna}                            1016         
## [20] {Coffee}                               1017

The following plot provides a visual representation of the frequency with which specific items appear in transactions.

itemFrequencyPlot(bakery_transactions, topN=25, type="relative", main="Item Frequency")

It is evident that, despite the data being sourced from the bakery, coffee is the most popular product purchased by clients. Bread is the second most popular product, with clients also expressing a preference for tea, cake, pastries and sandwiches. According to this plot, the majority of correlations are likely to be associated with coffee or bread; however, the outcomes of the data may not be immediately apparent.

2 Rules theory

Association rule learning is a machine learning technique based on rules that identifies meaningful relationships between variables in extensive databases. Its purpose is to uncover strong rules within datasets by applying specific measures of significance. In transactions involving various items, association rules aim to reveal the patterns or reasons behind the connections between certain items.

2.1 Support

Support is defined as the proportion of transactions in the dataset that contain both the antecedent (if part) and the consequent (then part) of a rule. It is a metric that indicates how frequently the rule occurs in the dataset. High support signifies that the rule is common and may be relevant.

\[ Support(A \Rightarrow B) = \frac{\text{Transactions containing A and B}}{\text{Total number of transactions}} \]

2.2 Confidence

Confidence is the probability that the consequent occurs in transactions where the antecedent is present. It measures the reliability of the rule. A high confidence value suggests a strong relationship between the antecedent and the consequent.

\[ Confidence(A \Rightarrow B) = \frac{\text{Transactions containing A and B}}{\text{Transactions containing A}} \]

2.3 Lift

Lift is the ratio of the observed frequency of the rule to the expected frequency if the antecedent and consequent were independent. It indicates how much more likely the consequent is to occur when the antecedent is present, compared to when it is not. Lift > 1 suggests a positive association.

\[ Lift(A \Rightarrow B) = \frac{Confidence(A \Rightarrow B)}{Support(B)} \]

3 Association Rules Implementation

The decision was taken to initiate the analysis by creating the rules. The Apriori algorithm was selected as the tool of choice. However, with the default values of the confidence and support (0.1 and 0.8, respectively) the algorithm was unable to identify any rules. Consequently, the values for these algorithms were reduced to 0.005 and 0.5, respectively.

rules<-apriori(bakery_transactions, parameter=list(supp=0.005, conf=0.5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.005      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 32 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[103 item(s), 6577 transaction(s)] done [0.00s].
## sorting and recoding items ... [37 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.01s].
## writing ... [17 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

A total of 17 rules were identified, and these will be subjected to further analysis. However, it is important to note that the initial parameters were significantly reduced.

Initially, the decision was taken to present the top ten rules with the highest level of support.

rules.by.supp <- sort(rules, by = "support", decreasing = TRUE)

rules.by.supp@quality$coverage <- NULL

inspect(head(rules.by.supp, 10))

##      lhs                rhs      support    confidence lift     count
## [1]  {Cake}          => {Coffee} 0.05686483 0.5389049  1.111787 374  
## [2]  {Pastry}        => {Coffee} 0.04895849 0.5590278  1.153302 322  
## [3]  {Sandwich}      => {Coffee} 0.04257260 0.5679513  1.171711 280  
## [4]  {Medialuna}     => {Coffee} 0.03314581 0.5751979  1.186661 218  
## [5]  {Cookies}       => {Coffee} 0.02995287 0.5267380  1.086686 197  
## [6]  {Hot chocolate} => {Coffee} 0.02736810 0.5263158  1.085815 180  
## [7]  {Toast}         => {Coffee} 0.02584765 0.7296137  1.505229 170  
## [8]  {Alfajores}     => {Coffee} 0.02250266 0.5522388  1.139296 148  
## [9]  {Juice}         => {Coffee} 0.02143835 0.5300752  1.093571 141  
## [10] {Scone}         => {Coffee} 0.01854949 0.5422222  1.118631 122

The results indicate that the combination of cake and coffee is the most popular option, with 5.68% of the total support. This is followed by pastry and coffee, which received 4.89% of the support, and then sandwich and coffee, which received 4.25%.It is noteworthy that all of the combinations consist of two products, with one of them being coffee. It is noteworthy that bread, despite being the second most popular product in the basket, is not represented in the data.

In the subsequent stage of the process, the decision was taken to present the results that had been identified as most significant, arranged in accordance with the degree of confidence associated with each.

rules.by.conf<-sort(rules, by="confidence", decreasing=TRUE) 

rules.by.conf@quality$coverage <- NULL

rules.by.conf@quality$support <- round(rules.by.conf@quality$support, 6)

rules.by.conf@quality$confidence <- round(rules.by.conf@quality$confidence, 6)

rules.by.conf@quality$lift <- round(rules.by.conf@quality$lift, 6)

inspect(head(rules.by.conf,10))

##      lhs                      rhs      support  confidence lift     count
## [1]  {Cake, Sandwich}      => {Coffee} 0.005626 0.755102   1.557812  37  
## [2]  {Toast}               => {Coffee} 0.025848 0.729614   1.505229 170  
## [3]  {Spanish Brunch}      => {Coffee} 0.014140 0.632653   1.305194  93  
## [4]  {Cake, Hot chocolate} => {Coffee} 0.006538 0.632353   1.304575  43  
## [5]  {Salad}               => {Coffee} 0.007906 0.611765   1.262101  52  
## [6]  {Medialuna}           => {Coffee} 0.033146 0.575198   1.186661 218  
## [7]  {Sandwich}            => {Coffee} 0.042573 0.567951   1.171711 280  
## [8]  {Pastry}              => {Coffee} 0.048958 0.559028   1.153302 322  
## [9]  {Alfajores}           => {Coffee} 0.022503 0.552239   1.139296 148  
## [10] {Tiffin}              => {Coffee} 0.010643 0.546875   1.128230  70

The results demonstrate that the only correlations observed are those pertaining to coffee. The correlation with the highest degree of confidence is that between cake and sandwich consumption and coffee consumption, with a confidence level of 75.51%. This indicates that in over 75% of transactions involving both cake and sandwich consumption, coffee was also consumed. However, the number of transactions in this category is relatively small.In contrast, other high-confidence correlations with a significantly higher number of transactions include toast and coffee consumption. This combination of products was observed in 170 instances, constituting 72.96% of all transactions involving toast. It is also worthwhile to consider the pairing of sandwiches with coffee and pastries with coffee, which occur with a high frequency. For both products, the probability of being accompanied by coffee is approximately 56%.

In the final stage of the tabular analysis, it was decided that the rules should be presented in a sorted list according to lift.

rules.by.lift<-sort(rules, by="lift", decreasing=TRUE) 

rules.by.lift@quality$coverage <- NULL

rules.by.lift@quality$support <- round(rules.by.lift@quality$support, 6)

rules.by.lift@quality$confidence <- round(rules.by.lift@quality$confidence, 6)

rules.by.lift@quality$lift <- round(rules.by.lift@quality$lift, 6)

inspect(head(rules.by.lift,10 ))

##      lhs                      rhs      support  confidence lift     count
## [1]  {Cake, Sandwich}      => {Coffee} 0.005626 0.755102   1.557812  37  
## [2]  {Toast}               => {Coffee} 0.025848 0.729614   1.505229 170  
## [3]  {Spanish Brunch}      => {Coffee} 0.014140 0.632653   1.305194  93  
## [4]  {Cake, Hot chocolate} => {Coffee} 0.006538 0.632353   1.304575  43  
## [5]  {Salad}               => {Coffee} 0.007906 0.611765   1.262101  52  
## [6]  {Medialuna}           => {Coffee} 0.033146 0.575198   1.186661 218  
## [7]  {Sandwich}            => {Coffee} 0.042573 0.567951   1.171711 280  
## [8]  {Pastry}              => {Coffee} 0.048958 0.559028   1.153302 322  
## [9]  {Alfajores}           => {Coffee} 0.022503 0.552239   1.139296 148  
## [10] {Tiffin}              => {Coffee} 0.010643 0.546875   1.128230  70

The results of the study indicate that transactions involving both cake and sandwich are 56% more likely to include coffee than transactions involving these items purchased separately. This strong relationship underscores the potential for combination deals or advertisements. The rule of toast with coffee, which has a lift of 1.50, similarly suggests that toast is commonly purchased with coffee, thereby reinforcing its role as a complementary product. The results of the lift around 1.3 are evident for the following combinations: Spanish brunch with coffee, cake and hot chocolate with coffee, and salad with coffee. It is also noteworthy that coffee is present in every rule, as it was in previous examples.

4 Visualisations

In selecting the initial visualisation, the decision was taken to employ a representation of the rules as a network, with lines signifying relationships. In this visualisation, the colour of the dot is indicative of the relationship’s power in terms of lift, and the size of the dot is representative of the relationship’s power in terms of support.

plot(rules, method="graph")

On the basis of the presented data, it can be concluded that the focal point of this establishment is coffee. The findings demonstrate a correlation between the bakery’s activities and the outcomes previously observed.

On the parallel coordinates plot below we can see another confiramtion of how the rules look like. All of them are related to coffee, and what’s most visible in this plot, there is only one rule with two positions: hot chocolate and sandwich.

plot(rules, method="paracoord")

4.1 Individual rule representation

As is customary in the analysis of general rules, all of the aforementioned rules were related to coffee. In this section, the focus has been directed towards bread, toast, cake and pastry. For each of the products under consideration, the decision was taken to adjust the support and confidence parameters so as to display approximately 10 rules.

4.1.1 Bread

rules.Bread <- apriori(data=bakery_transactions,  parameter=list(supp=0.01, conf = 0.05, target="rules"), appearance = list(default="lhs", rhs="Bread"), control=list(verbose=F)) 
rules.Bread.byconf <- sort(rules.Bread, by="confidence", decreasing=TRUE)

rules.Bread.byconf@quality$support <- round(rules.Bread.byconf@quality$support, 6)

rules.Bread.byconf@quality$confidence <- round(rules.Bread.byconf@quality$confidence, 6)

rules.Bread.byconf@quality$lift <- round(rules.Bread.byconf@quality$lift, 6)

rules.Bread.byconf@quality$coverage <- NULL

inspect(rules.Bread.byconf)

##      lhs                 rhs     support  confidence lift     count
## [1]  {Pastry}         => {Bread} 0.029801 0.340278   1.042874  196 
## [2]  {}               => {Bread} 0.326289 0.326289   1.000000 2146 
## [3]  {Medialuna}      => {Bread} 0.016421 0.284960   0.873339  108 
## [4]  {Alfajores}      => {Bread} 0.011251 0.276119   0.846243   74 
## [5]  {Brownie}        => {Bread} 0.011860 0.268966   0.824318   78 
## [6]  {Cookies}        => {Bread} 0.015205 0.267380   0.819458  100 
## [7]  {Coffee, Pastry} => {Bread} 0.011403 0.232919   0.713844   75 
## [8]  {Hot chocolate}  => {Bread} 0.012012 0.230994   0.707944   79 
## [9]  {Sandwich}       => {Bread} 0.017029 0.227181   0.696256  112 
## [10] {Cake}           => {Bread} 0.023415 0.221902   0.680079  154 
## [11] {Tea}            => {Bread} 0.029649 0.207226   0.635101  195 
## [12] {Coffee}         => {Bread} 0.090315 0.186324   0.571040  594

An examination of the regulations pertaining to bread reveals that the sole provision that exceeds the 1% threshold is that pertaining to pastry, albeit by a negligible margin. This observation signifies that, in the vast majority of cases, bread is not a constituent of transactions.A further analysis reveals that transactions comprising solely bread account for over 32% of all transactions, with no concomitant inclusion of other products. This finding is particularly noteworthy in light of the fact that bread emerged as the second most commonly purchased product, yet it was not included in the aforementioned rules. Consequently, it can be deduced that, in the context of this particular bakery, the implementation of bundles or specialized combination offers featuring bread would not be a financially viable proposition.

4.1.2 Toast

rules.Toast <- apriori(data=bakery_transactions,  parameter=list(supp=0.001, conf = 0.05, target="rules"), appearance = list(default="lhs", rhs="Toast"), control=list(verbose=F)) 
rules.Toast.byconf <- sort(rules.Toast, by="confidence", decreasing=TRUE)

rules.Toast.byconf@quality$support <- round(rules.Toast.byconf@quality$support, 6)

rules.Toast.byconf@quality$confidence <- round(rules.Toast.byconf@quality$confidence, 6)

rules.Toast.byconf@quality$lift <- round(rules.Toast.byconf@quality$lift, 6)

rules.Toast.byconf@quality$coverage <- NULL

inspect(rules.Toast.byconf)

##     lhs                        rhs     support  confidence lift     count
## [1] {Bread, Hot chocolate}  => {Toast} 0.001064 0.088608   2.501168   7  
## [2] {Coffee, Juice}         => {Toast} 0.001672 0.078014   2.202143  11  
## [3] {Coffee, Hot chocolate} => {Toast} 0.001825 0.066667   1.881831  12  
## [4] {Hot chocolate}         => {Toast} 0.003193 0.061404   1.733266  21  
## [5] {Juice}                 => {Toast} 0.002433 0.060150   1.697893  16  
## [6] {Coffee, Tea}           => {Toast} 0.003041 0.058480   1.650729  20  
## [7] {Spanish Brunch}        => {Toast} 0.001216 0.054422   1.536189   8  
## [8] {Coffee}                => {Toast} 0.025848 0.053325   1.505229 170  
## [9] {Bread, Coffee}         => {Toast} 0.004561 0.050505   1.425630  30

In the context of the presented data, it is evident that the circumstances pertaining to toast diverge significantly from those associated with bread. Despite the modest numerical values assigned to specific rules, it is noteworthy to emphasise the remarkably high levels of confidence and lift values evident in the above table. To illustrate this point, consider the scenario where bread and hot chocolate are involved in a transaction, in which case there is nearly a 90% probability of toast being included. While the amounts involved are indeed negligible, it is postulated that the elevated levels of confidence and lift values may prove conducive to the generation of bundles, which, in turn, could potentially attract a customer base.

plot(rules.Toast, method="paracoord")

As illustrated in the parallel plot, it is evident that five of the nine rules are comprised of two positions and three elements. This observation is particularly noteworthy in the context of promotion bundles.

4.1.3 Cake

rules.Cake <- apriori(data=bakery_transactions,  parameter=list(supp=0.005, conf = 0.05, target="rules"), appearance = list(default="lhs", rhs="Cake"), control=list(verbose=F)) 
rules.Cake.byconf <- sort(rules.Cake, by="confidence", decreasing=TRUE)

rules.Cake.byconf@quality$support <- round(rules.Cake.byconf@quality$support, 6)

rules.Cake.byconf@quality$confidence <- round(rules.Cake.byconf@quality$confidence, 6)

rules.Cake.byconf@quality$lift <- round(rules.Cake.byconf@quality$lift, 6)

rules.Cake.byconf@quality$coverage <- NULL

inspect(rules.Cake.byconf)

##      lhs                        rhs    support  confidence lift     count
## [1]  {Coffee, Hot chocolate} => {Cake} 0.006538 0.238889   2.263937  43  
## [2]  {Coffee, Tea}           => {Cake} 0.011251 0.216374   2.050567  74  
## [3]  {Hot chocolate}         => {Cake} 0.010339 0.198830   1.884305  68  
## [4]  {Tea}                   => {Cake} 0.026304 0.183847   1.742308 173  
## [5]  {Bread, Tea}            => {Cake} 0.005170 0.174359   1.652390  34  
## [6]  {Juice}                 => {Cake} 0.006994 0.172932   1.638870  46  
## [7]  {Soup}                  => {Cake} 0.005170 0.146552   1.388863  34  
## [8]  {Cookies}               => {Cake} 0.007602 0.133690   1.266971  50  
## [9]  {Coffee, Sandwich}      => {Cake} 0.005626 0.132143   1.252311  37  
## [10] {Coffee}                => {Cake} 0.056865 0.117315   1.111787 374  
## [11] {Bread, Coffee}         => {Cake} 0.009731 0.107744   1.021085  64  
## [12] {}                      => {Cake} 0.105519 0.105519   1.000000 694  
## [13] {Sandwich}              => {Cake} 0.007450 0.099391   0.941928  49  
## [14] {Bread}                 => {Cake} 0.023415 0.071761   0.680079 154

The situation regarding the cake is analogous to that of the toast. However, the support values are marginally higher, and there is a greater potential for a special offer to be visible when pairing tea and cake. Nevertheless, it is evident that nearly 700 cakes were purchased in isolation, which constitutes a support value in excess of 10%.

plot(rules.Cake, method="paracoord")

In the context of a parallel coordinates plot, a parallel can be drawn with the situation observed in the toast example. In this scenario, three element relations can also be discerned.

4.1.4 Pastry

rules.Pastry <- apriori(data=bakery_transactions,  parameter=list(supp=0.003, conf = 0.05, target="rules"), appearance = list(default="lhs", rhs="Pastry"), control=list(verbose=F)) 
rules.Pastry.byconf <- sort(rules.Pastry, by="confidence", decreasing=TRUE)

rules.Pastry.byconf@quality$support <- round(rules.Pastry.byconf@quality$support, 6)

rules.Pastry.byconf@quality$confidence <- round(rules.Pastry.byconf@quality$confidence, 6)

rules.Pastry.byconf@quality$lift <- round(rules.Pastry.byconf@quality$lift, 6)

rules.Pastry.byconf@quality$coverage <- NULL

inspect(rules.Pastry.byconf)

##      lhs                        rhs      support  confidence lift     count
## [1]  {Medialuna}             => {Pastry} 0.008058 0.139842   1.596769  53  
## [2]  {Bread, Coffee}         => {Pastry} 0.011403 0.126263   1.441718  75  
## [3]  {Coffee, Hot chocolate} => {Pastry} 0.003345 0.122222   1.395583  22  
## [4]  {Coffee, Medialuna}     => {Pastry} 0.003953 0.119266   1.361828  26  
## [5]  {Coffee}                => {Pastry} 0.048958 0.101004   1.153302 322  
## [6]  {Hot chocolate}         => {Pastry} 0.004865 0.093567   1.068389  32  
## [7]  {Bread}                 => {Pastry} 0.029801 0.091333   1.042874 196  
## [8]  {Coffee, Tea}           => {Pastry} 0.004561 0.087719   1.001614  30  
## [9]  {}                      => {Pastry} 0.087578 0.087578   1.000000 576  
## [10] {Alfajores}             => {Pastry} 0.003345 0.082090   0.937332  22  
## [11] {Tea}                   => {Pastry} 0.009275 0.064825   0.740194  61  
## [12] {Cookies}               => {Pastry} 0.003649 0.064171   0.732732  24

In the case of the pastry, the results for confidence and lift are marginally lower than for the two aforementioned items. Nevertheless, a relatively high level of lift is still evident for medialuna pastry and bread, and coffee with pastry.

5 Conclusion

The analysis of bakery sales data using Association Rules yielded noteworthy results. It is noteworthy that the data is somewhat unusual, given that coffee is the most popular product at this bakery. Despite its popularity, I was still surprised to find that coffee dominated the Association Rules.The second most popular product, bread, was purchased primarily without any company affiliation. This finding suggests that the bakery’s clientele comprise distinct segments, with the coffee and other product-buying demographic being a potential target audience for bundled offers.In conclusion, the application of Association Rules in the analysis of market basket data offers a novel and compelling approach to understanding consumer behaviour. The findings of this investigation indicate the viability of implementing promotional and bundled offers, such as Toast and Coffee, with the objective of substantially augmenting the bakery’s revenue. This analysis could also be employed to avoid the incorporation of bread in any mixed-products offers.