The objective of this paper is to apply Association Rules to analyse Bakery basket sales data.The data set under scrutiny in this article is sourced from Kaggle (https://www.kaggle.com/datasets/akashdeepkuila/bakery).The dataset originates from “The Bread Basket,” a bakery situated in Edinburgh. It contains transaction details of customers who placed orders for various items from this bakery between October 30, 2016, and April 9, 2017.
# Importing necessary libraries for whole project
library(arules)
library(arulesViz)
library(latex2exp)
The dataset comprises over 6,000 transactions, and for the purpose of analysis I decided to choose from it two columns:
bakery <- read.csv("Bakery.csv")[, c("TransactionNo", "Items")]
write.csv(bakery, "filtered_bakery.csv", row.names = FALSE, quote = FALSE)
bakery_transactions <- read.transactions(
"filtered_bakery.csv",
format = "single",
cols = c(1, 2),
sep = ","
)
The following example illustrates the data.
inspect(head(bakery_transactions, 20))
## items transactionID
## [1] {Bread} 1
## [2] {Medialuna, Scandinavian} 10
## [3] {Chimichurri Oil, Scandinavian} 1000
## [4] {Bread, Truffles} 1001
## [5] {Brownie, Focaccia} 1002
## [6] {Bread, Coffee} 1003
## [7] {Art Tray, Coffee, Cookies, Tea} 1004
## [8] {Coffee} 1005
## [9] {Bread} 1006
## [10] {Alfajores, Coffee, Coke} 1007
## [11] {Bread} 1008
## [12] {Bread} 1009
## [13] {Bread, Coffee, Medialuna, Pastry} 1010
## [14] {Coffee} 1011
## [15] {Bread, Keeping It Local} 1012
## [16] {Bread} 1013
## [17] {Coffee, Scandinavian} 1014
## [18] {Bread, Farm House, Medialuna, Pastry} 1015
## [19] {Medialuna} 1016
## [20] {Coffee} 1017
The following plot provides a visual representation of the frequency with which specific items appear in transactions.
itemFrequencyPlot(bakery_transactions, topN=25, type="relative", main="Item Frequency")
It is evident that, despite the data being sourced from the bakery, coffee is the most popular product purchased by clients. Bread is the second most popular product, with clients also expressing a preference for tea, cake, pastries and sandwiches. According to this plot, the majority of correlations are likely to be associated with coffee or bread; however, the outcomes of the data may not be immediately apparent.
Association rule learning is a machine learning technique based on rules that identifies meaningful relationships between variables in extensive databases. Its purpose is to uncover strong rules within datasets by applying specific measures of significance. In transactions involving various items, association rules aim to reveal the patterns or reasons behind the connections between certain items.
Support is defined as the proportion of transactions in the dataset that contain both the antecedent (if part) and the consequent (then part) of a rule. It is a metric that indicates how frequently the rule occurs in the dataset. High support signifies that the rule is common and may be relevant.
\[ Support(A \Rightarrow B) = \frac{\text{Transactions containing A and B}}{\text{Total number of transactions}} \]
Confidence is the probability that the consequent occurs in transactions where the antecedent is present. It measures the reliability of the rule. A high confidence value suggests a strong relationship between the antecedent and the consequent.
\[ Confidence(A \Rightarrow B) = \frac{\text{Transactions containing A and B}}{\text{Transactions containing A}} \]
Lift is the ratio of the observed frequency of the rule to the expected frequency if the antecedent and consequent were independent. It indicates how much more likely the consequent is to occur when the antecedent is present, compared to when it is not. Lift > 1 suggests a positive association.
\[ Lift(A \Rightarrow B) = \frac{Confidence(A \Rightarrow B)}{Support(B)} \]
The decision was taken to initiate the analysis by creating the rules. The Apriori algorithm was selected as the tool of choice. However, with the default values of the confidence and support (0.1 and 0.8, respectively) the algorithm was unable to identify any rules. Consequently, the values for these algorithms were reduced to 0.005 and 0.5, respectively.
rules<-apriori(bakery_transactions, parameter=list(supp=0.005, conf=0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.005 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 32
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[103 item(s), 6577 transaction(s)] done [0.00s].
## sorting and recoding items ... [37 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.01s].
## writing ... [17 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
A total of 17 rules were identified, and these will be subjected to further analysis. However, it is important to note that the initial parameters were significantly reduced.
Initially, the decision was taken to present the top ten rules with the highest level of support.
rules.by.supp <- sort(rules, by = "support", decreasing = TRUE)
rules.by.supp@quality$coverage <- NULL
inspect(head(rules.by.supp, 10))
## lhs rhs support confidence lift count
## [1] {Cake} => {Coffee} 0.05686483 0.5389049 1.111787 374
## [2] {Pastry} => {Coffee} 0.04895849 0.5590278 1.153302 322
## [3] {Sandwich} => {Coffee} 0.04257260 0.5679513 1.171711 280
## [4] {Medialuna} => {Coffee} 0.03314581 0.5751979 1.186661 218
## [5] {Cookies} => {Coffee} 0.02995287 0.5267380 1.086686 197
## [6] {Hot chocolate} => {Coffee} 0.02736810 0.5263158 1.085815 180
## [7] {Toast} => {Coffee} 0.02584765 0.7296137 1.505229 170
## [8] {Alfajores} => {Coffee} 0.02250266 0.5522388 1.139296 148
## [9] {Juice} => {Coffee} 0.02143835 0.5300752 1.093571 141
## [10] {Scone} => {Coffee} 0.01854949 0.5422222 1.118631 122
The results indicate that the combination of cake and coffee is the most popular option, with 5.68% of the total support. This is followed by pastry and coffee, which received 4.89% of the support, and then sandwich and coffee, which received 4.25%.It is noteworthy that all of the combinations consist of two products, with one of them being coffee. It is noteworthy that bread, despite being the second most popular product in the basket, is not represented in the data.
In the subsequent stage of the process, the decision was taken to present the results that had been identified as most significant, arranged in accordance with the degree of confidence associated with each.
rules.by.conf<-sort(rules, by="confidence", decreasing=TRUE)
rules.by.conf@quality$coverage <- NULL
rules.by.conf@quality$support <- round(rules.by.conf@quality$support, 6)
rules.by.conf@quality$confidence <- round(rules.by.conf@quality$confidence, 6)
rules.by.conf@quality$lift <- round(rules.by.conf@quality$lift, 6)
inspect(head(rules.by.conf,10))
## lhs rhs support confidence lift count
## [1] {Cake, Sandwich} => {Coffee} 0.005626 0.755102 1.557812 37
## [2] {Toast} => {Coffee} 0.025848 0.729614 1.505229 170
## [3] {Spanish Brunch} => {Coffee} 0.014140 0.632653 1.305194 93
## [4] {Cake, Hot chocolate} => {Coffee} 0.006538 0.632353 1.304575 43
## [5] {Salad} => {Coffee} 0.007906 0.611765 1.262101 52
## [6] {Medialuna} => {Coffee} 0.033146 0.575198 1.186661 218
## [7] {Sandwich} => {Coffee} 0.042573 0.567951 1.171711 280
## [8] {Pastry} => {Coffee} 0.048958 0.559028 1.153302 322
## [9] {Alfajores} => {Coffee} 0.022503 0.552239 1.139296 148
## [10] {Tiffin} => {Coffee} 0.010643 0.546875 1.128230 70
The results demonstrate that the only correlations observed are those pertaining to coffee. The correlation with the highest degree of confidence is that between cake and sandwich consumption and coffee consumption, with a confidence level of 75.51%. This indicates that in over 75% of transactions involving both cake and sandwich consumption, coffee was also consumed. However, the number of transactions in this category is relatively small.In contrast, other high-confidence correlations with a significantly higher number of transactions include toast and coffee consumption. This combination of products was observed in 170 instances, constituting 72.96% of all transactions involving toast. It is also worthwhile to consider the pairing of sandwiches with coffee and pastries with coffee, which occur with a high frequency. For both products, the probability of being accompanied by coffee is approximately 56%.
In the final stage of the tabular analysis, it was decided that the rules should be presented in a sorted list according to lift.
rules.by.lift<-sort(rules, by="lift", decreasing=TRUE)
rules.by.lift@quality$coverage <- NULL
rules.by.lift@quality$support <- round(rules.by.lift@quality$support, 6)
rules.by.lift@quality$confidence <- round(rules.by.lift@quality$confidence, 6)
rules.by.lift@quality$lift <- round(rules.by.lift@quality$lift, 6)
inspect(head(rules.by.lift,10 ))
## lhs rhs support confidence lift count
## [1] {Cake, Sandwich} => {Coffee} 0.005626 0.755102 1.557812 37
## [2] {Toast} => {Coffee} 0.025848 0.729614 1.505229 170
## [3] {Spanish Brunch} => {Coffee} 0.014140 0.632653 1.305194 93
## [4] {Cake, Hot chocolate} => {Coffee} 0.006538 0.632353 1.304575 43
## [5] {Salad} => {Coffee} 0.007906 0.611765 1.262101 52
## [6] {Medialuna} => {Coffee} 0.033146 0.575198 1.186661 218
## [7] {Sandwich} => {Coffee} 0.042573 0.567951 1.171711 280
## [8] {Pastry} => {Coffee} 0.048958 0.559028 1.153302 322
## [9] {Alfajores} => {Coffee} 0.022503 0.552239 1.139296 148
## [10] {Tiffin} => {Coffee} 0.010643 0.546875 1.128230 70
The results of the study indicate that transactions involving both cake and sandwich are 56% more likely to include coffee than transactions involving these items purchased separately. This strong relationship underscores the potential for combination deals or advertisements. The rule of toast with coffee, which has a lift of 1.50, similarly suggests that toast is commonly purchased with coffee, thereby reinforcing its role as a complementary product. The results of the lift around 1.3 are evident for the following combinations: Spanish brunch with coffee, cake and hot chocolate with coffee, and salad with coffee. It is also noteworthy that coffee is present in every rule, as it was in previous examples.
In selecting the initial visualisation, the decision was taken to employ a representation of the rules as a network, with lines signifying relationships. In this visualisation, the colour of the dot is indicative of the relationship’s power in terms of lift, and the size of the dot is representative of the relationship’s power in terms of support.
plot(rules, method="graph")
On the basis of the presented data, it can be concluded that the focal point of this establishment is coffee. The findings demonstrate a correlation between the bakery’s activities and the outcomes previously observed.
On the parallel coordinates plot below we can see another confiramtion of how the rules look like. All of them are related to coffee, and what’s most visible in this plot, there is only one rule with two positions: hot chocolate and sandwich.
plot(rules, method="paracoord")
As is customary in the analysis of general rules, all of the aforementioned rules were related to coffee. In this section, the focus has been directed towards bread, toast, cake and pastry. For each of the products under consideration, the decision was taken to adjust the support and confidence parameters so as to display approximately 10 rules.
rules.Bread <- apriori(data=bakery_transactions, parameter=list(supp=0.01, conf = 0.05, target="rules"), appearance = list(default="lhs", rhs="Bread"), control=list(verbose=F))
rules.Bread.byconf <- sort(rules.Bread, by="confidence", decreasing=TRUE)
rules.Bread.byconf@quality$support <- round(rules.Bread.byconf@quality$support, 6)
rules.Bread.byconf@quality$confidence <- round(rules.Bread.byconf@quality$confidence, 6)
rules.Bread.byconf@quality$lift <- round(rules.Bread.byconf@quality$lift, 6)
rules.Bread.byconf@quality$coverage <- NULL
inspect(rules.Bread.byconf)
## lhs rhs support confidence lift count
## [1] {Pastry} => {Bread} 0.029801 0.340278 1.042874 196
## [2] {} => {Bread} 0.326289 0.326289 1.000000 2146
## [3] {Medialuna} => {Bread} 0.016421 0.284960 0.873339 108
## [4] {Alfajores} => {Bread} 0.011251 0.276119 0.846243 74
## [5] {Brownie} => {Bread} 0.011860 0.268966 0.824318 78
## [6] {Cookies} => {Bread} 0.015205 0.267380 0.819458 100
## [7] {Coffee, Pastry} => {Bread} 0.011403 0.232919 0.713844 75
## [8] {Hot chocolate} => {Bread} 0.012012 0.230994 0.707944 79
## [9] {Sandwich} => {Bread} 0.017029 0.227181 0.696256 112
## [10] {Cake} => {Bread} 0.023415 0.221902 0.680079 154
## [11] {Tea} => {Bread} 0.029649 0.207226 0.635101 195
## [12] {Coffee} => {Bread} 0.090315 0.186324 0.571040 594
An examination of the regulations pertaining to bread reveals that the sole provision that exceeds the 1% threshold is that pertaining to pastry, albeit by a negligible margin. This observation signifies that, in the vast majority of cases, bread is not a constituent of transactions.A further analysis reveals that transactions comprising solely bread account for over 32% of all transactions, with no concomitant inclusion of other products. This finding is particularly noteworthy in light of the fact that bread emerged as the second most commonly purchased product, yet it was not included in the aforementioned rules. Consequently, it can be deduced that, in the context of this particular bakery, the implementation of bundles or specialized combination offers featuring bread would not be a financially viable proposition.
rules.Toast <- apriori(data=bakery_transactions, parameter=list(supp=0.001, conf = 0.05, target="rules"), appearance = list(default="lhs", rhs="Toast"), control=list(verbose=F))
rules.Toast.byconf <- sort(rules.Toast, by="confidence", decreasing=TRUE)
rules.Toast.byconf@quality$support <- round(rules.Toast.byconf@quality$support, 6)
rules.Toast.byconf@quality$confidence <- round(rules.Toast.byconf@quality$confidence, 6)
rules.Toast.byconf@quality$lift <- round(rules.Toast.byconf@quality$lift, 6)
rules.Toast.byconf@quality$coverage <- NULL
inspect(rules.Toast.byconf)
## lhs rhs support confidence lift count
## [1] {Bread, Hot chocolate} => {Toast} 0.001064 0.088608 2.501168 7
## [2] {Coffee, Juice} => {Toast} 0.001672 0.078014 2.202143 11
## [3] {Coffee, Hot chocolate} => {Toast} 0.001825 0.066667 1.881831 12
## [4] {Hot chocolate} => {Toast} 0.003193 0.061404 1.733266 21
## [5] {Juice} => {Toast} 0.002433 0.060150 1.697893 16
## [6] {Coffee, Tea} => {Toast} 0.003041 0.058480 1.650729 20
## [7] {Spanish Brunch} => {Toast} 0.001216 0.054422 1.536189 8
## [8] {Coffee} => {Toast} 0.025848 0.053325 1.505229 170
## [9] {Bread, Coffee} => {Toast} 0.004561 0.050505 1.425630 30
In the context of the presented data, it is evident that the circumstances pertaining to toast diverge significantly from those associated with bread. Despite the modest numerical values assigned to specific rules, it is noteworthy to emphasise the remarkably high levels of confidence and lift values evident in the above table. To illustrate this point, consider the scenario where bread and hot chocolate are involved in a transaction, in which case there is nearly a 90% probability of toast being included. While the amounts involved are indeed negligible, it is postulated that the elevated levels of confidence and lift values may prove conducive to the generation of bundles, which, in turn, could potentially attract a customer base.
plot(rules.Toast, method="paracoord")
As illustrated in the parallel plot, it is evident that five of the nine rules are comprised of two positions and three elements. This observation is particularly noteworthy in the context of promotion bundles.
rules.Cake <- apriori(data=bakery_transactions, parameter=list(supp=0.005, conf = 0.05, target="rules"), appearance = list(default="lhs", rhs="Cake"), control=list(verbose=F))
rules.Cake.byconf <- sort(rules.Cake, by="confidence", decreasing=TRUE)
rules.Cake.byconf@quality$support <- round(rules.Cake.byconf@quality$support, 6)
rules.Cake.byconf@quality$confidence <- round(rules.Cake.byconf@quality$confidence, 6)
rules.Cake.byconf@quality$lift <- round(rules.Cake.byconf@quality$lift, 6)
rules.Cake.byconf@quality$coverage <- NULL
inspect(rules.Cake.byconf)
## lhs rhs support confidence lift count
## [1] {Coffee, Hot chocolate} => {Cake} 0.006538 0.238889 2.263937 43
## [2] {Coffee, Tea} => {Cake} 0.011251 0.216374 2.050567 74
## [3] {Hot chocolate} => {Cake} 0.010339 0.198830 1.884305 68
## [4] {Tea} => {Cake} 0.026304 0.183847 1.742308 173
## [5] {Bread, Tea} => {Cake} 0.005170 0.174359 1.652390 34
## [6] {Juice} => {Cake} 0.006994 0.172932 1.638870 46
## [7] {Soup} => {Cake} 0.005170 0.146552 1.388863 34
## [8] {Cookies} => {Cake} 0.007602 0.133690 1.266971 50
## [9] {Coffee, Sandwich} => {Cake} 0.005626 0.132143 1.252311 37
## [10] {Coffee} => {Cake} 0.056865 0.117315 1.111787 374
## [11] {Bread, Coffee} => {Cake} 0.009731 0.107744 1.021085 64
## [12] {} => {Cake} 0.105519 0.105519 1.000000 694
## [13] {Sandwich} => {Cake} 0.007450 0.099391 0.941928 49
## [14] {Bread} => {Cake} 0.023415 0.071761 0.680079 154
The situation regarding the cake is analogous to that of the toast. However, the support values are marginally higher, and there is a greater potential for a special offer to be visible when pairing tea and cake. Nevertheless, it is evident that nearly 700 cakes were purchased in isolation, which constitutes a support value in excess of 10%.
plot(rules.Cake, method="paracoord")
In the context of a parallel coordinates plot, a parallel can be drawn with the situation observed in the toast example. In this scenario, three element relations can also be discerned.
rules.Pastry <- apriori(data=bakery_transactions, parameter=list(supp=0.003, conf = 0.05, target="rules"), appearance = list(default="lhs", rhs="Pastry"), control=list(verbose=F))
rules.Pastry.byconf <- sort(rules.Pastry, by="confidence", decreasing=TRUE)
rules.Pastry.byconf@quality$support <- round(rules.Pastry.byconf@quality$support, 6)
rules.Pastry.byconf@quality$confidence <- round(rules.Pastry.byconf@quality$confidence, 6)
rules.Pastry.byconf@quality$lift <- round(rules.Pastry.byconf@quality$lift, 6)
rules.Pastry.byconf@quality$coverage <- NULL
inspect(rules.Pastry.byconf)
## lhs rhs support confidence lift count
## [1] {Medialuna} => {Pastry} 0.008058 0.139842 1.596769 53
## [2] {Bread, Coffee} => {Pastry} 0.011403 0.126263 1.441718 75
## [3] {Coffee, Hot chocolate} => {Pastry} 0.003345 0.122222 1.395583 22
## [4] {Coffee, Medialuna} => {Pastry} 0.003953 0.119266 1.361828 26
## [5] {Coffee} => {Pastry} 0.048958 0.101004 1.153302 322
## [6] {Hot chocolate} => {Pastry} 0.004865 0.093567 1.068389 32
## [7] {Bread} => {Pastry} 0.029801 0.091333 1.042874 196
## [8] {Coffee, Tea} => {Pastry} 0.004561 0.087719 1.001614 30
## [9] {} => {Pastry} 0.087578 0.087578 1.000000 576
## [10] {Alfajores} => {Pastry} 0.003345 0.082090 0.937332 22
## [11] {Tea} => {Pastry} 0.009275 0.064825 0.740194 61
## [12] {Cookies} => {Pastry} 0.003649 0.064171 0.732732 24
In the case of the pastry, the results for confidence and lift are marginally lower than for the two aforementioned items. Nevertheless, a relatively high level of lift is still evident for medialuna pastry and bread, and coffee with pastry.
The analysis of bakery sales data using Association Rules yielded noteworthy results. It is noteworthy that the data is somewhat unusual, given that coffee is the most popular product at this bakery. Despite its popularity, I was still surprised to find that coffee dominated the Association Rules.The second most popular product, bread, was purchased primarily without any company affiliation. This finding suggests that the bakery’s clientele comprise distinct segments, with the coffee and other product-buying demographic being a potential target audience for bundled offers.In conclusion, the application of Association Rules in the analysis of market basket data offers a novel and compelling approach to understanding consumer behaviour. The findings of this investigation indicate the viability of implementing promotional and bundled offers, such as Toast and Coffee, with the objective of substantially augmenting the bakery’s revenue. This analysis could also be employed to avoid the incorporation of bread in any mixed-products offers.