Market basket analysis, is a common practical application of unsupervised learning methods for business purposes. It allows for determining the most popular combinations, by producing rules such as “if client purchases X they will likely also purchase Y”. In this paper, an overview of the most popular methods of association rule mining is presented, on an example of sales performed by a bakery.
The dataset was downloaded from Kaggle. It concerns sales of a bakery. The original data is structured as follows:
head(df)
## TransactionNo Items DateTime Daypart DayType
## 1 1 Bread 2016-10-30 09:58:11 Morning Weekend
## 2 2 Scandinavian 2016-10-30 10:05:34 Morning Weekend
## 3 2 Scandinavian 2016-10-30 10:05:34 Morning Weekend
## 4 3 Hot chocolate 2016-10-30 10:07:57 Morning Weekend
## 5 3 Jam 2016-10-30 10:07:57 Morning Weekend
## 6 3 Cookies 2016-10-30 10:07:57 Morning Weekend
The dataset contains information about transactions of a bakery. The rows show an individual item which has been purchased, with information about which transaction the purchase was a part of, as well as when the transaction took place. It could prove usefull to verify the number of transactions and the number of unique items that the baker offers:
print("Number of transactions:")
## [1] "Number of transactions:"
print(length(unique(df$TransactionNo)))
## [1] 9465
print("Number of unique items:")
## [1] "Number of unique items:"
print(length(unique(df$Items)))
## [1] 94
The library arules allows for easy and efficient transformation of a dataset orginized similarly to this one (one item per row rather than one transaction per row)
trans<-read.transactions("Bakery.csv", format="single", sep=",", cols=c("TransactionNo","Items"), header=TRUE)
itemFrequencyPlot(trans, topN=10, type="relative", main="Items Frequency", cex.names=0.8)
The plot above shows the relative frequency of purchase of the 10 most popular items. As can be seen, the most popular are bread and coffee, which can be suprising, as oftentimes it is not offered in bakeries. We can however reason, that either due to the characteristics of the culture of the region where the bakery operates, or due to the characteristics of the bakery itself, it can be treated as a hybrid between a bakery and a cafe.
Another interesting thing to analyze, is how many items a typical transaction contains:
hist(size(trans),breaks = 20, xaxt="n",
main = "Number of items in particular transaction", xlab = "Items")
axis(1, at=seq(0,20,by=1), cex.axis=0.8)
Most of the transactions contain either 1 or 2 items, which may prove problematic for the basket analysis, as not many connections may be made between the items.
Firstly, the data will be analyzed with the use of eclat algorithm. Its goal is to identify the most often occurring baskets, basing on support, being the number of occurrences of a given basket, compared to number of all transactions analyzed.
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.05 1 9 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 328
##
## create itemset ...
## set transactions ...[102 item(s), 6576 transaction(s)] done [0.00s].
## sorting and recoding items ... [9 item(s)] done [0.00s].
## creating bit matrix ... [9 row(s), 6576 column(s)] done [0.00s].
## writing ... [12 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
## items support count
## [1] {Cake, Coffee} 0.05687348 374
## [2] {Coffee, Tea} 0.05200730 342
## [3] {Bread, Coffee} 0.09032847 594
## [4] {Coffee} 0.48479319 3188
## [5] {Bread} 0.32633820 2146
## [6] {Tea} 0.14309611 941
The results of the eclat analysis are as expected: the basket with most support contains only coffee, which could be predicted based on the initial analysis of the dataset. What is here interesting, is that cake, tea and bread are also often purchased alongside coffee. Probably the least informative pair is bread with coffee, as these are the 2 most popular items, but it is still an information for the owners, that they may want to offer these products in bundles. In order to further prove the utility of this information, it may be usefull to take a look at the rules stemming from eclat algorithm.
rules <- ruleInduction(freq.items, trans, confidence=0.25)
inspect(rules)
## lhs rhs support confidence lift itemset
## [1] {Cake} => {Coffee} 0.05687348 0.5389049 1.1116181 1
## [2] {Tea} => {Coffee} 0.05200730 0.3634431 0.7496870 2
## [3] {Bread} => {Coffee} 0.09032847 0.2767940 0.5709528 3
The rules shown are only the ones with confidence level of 25% or more. The results mean, that if a customer purchases the item shown in the lhs columns, they are more likely to purchase on in the rhs column. The lift column shows how much more likely it is to notice these items together, compared to seeing them purchased independently. Here, the only positive lift is for cake and coffee, meaning that it is 11% more likely that these are purchased together than separately. Moreover, if a cake was purchased, the confidence tells us that there is almost 54% chance that coffee will also be purchased.
#Apriori
The second algorithm is apriori. It inspects each item for the associations with a bottom-up approach. This means, that for each item the best association is found, if there is one, and for this association another item is added and so on. When at a given level no significant association is found, the algorithm proceeds to the next item/set of items.
rules.trans<-apriori(trans, parameter=list(supp=0.01, conf=0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 65
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[102 item(s), 6576 transaction(s)] done [0.00s].
## sorting and recoding items ... [30 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [12 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules.by.conf<-sort(rules.trans, by="lift", decreasing=TRUE)
inspect(head(rules.by.conf))
## lhs rhs support confidence coverage lift
## [1] {Toast} => {Coffee} 0.02585158 0.7296137 0.03543187 1.505000
## [2] {Spanish Brunch} => {Coffee} 0.01414234 0.6326531 0.02235401 1.304996
## [3] {Medialuna} => {Coffee} 0.03315085 0.5751979 0.05763382 1.186481
## [4] {Sandwich} => {Coffee} 0.04257908 0.5679513 0.07496959 1.171533
## [5] {Pastry} => {Coffee} 0.04896594 0.5590278 0.08759124 1.153126
## [6] {Alfajores} => {Coffee} 0.02250608 0.5522388 0.04075426 1.139122
## count
## [1] 170
## [2] 93
## [3] 218
## [4] 280
## [5] 322
## [6] 148
Oftentimes the apriori algorithm yields similar results to the eclat algorithm, however here, in the interest of exploring different aspects of the dataset, I have decided to pass different requirements for the resulting associations. The required support level was lowered to 1%, and the confidence requirement was risen to 50%. The rules achieved here are significantly different, as items which appear less often in the dataset, such as toast and spanish brunch, and therefore will have lower support values, are allowed to occur.
Lets take a look at all of the identified rules:
inspect(rules.by.conf)
## lhs rhs support confidence coverage lift
## [1] {Toast} => {Coffee} 0.02585158 0.7296137 0.03543187 1.505000
## [2] {Spanish Brunch} => {Coffee} 0.01414234 0.6326531 0.02235401 1.304996
## [3] {Medialuna} => {Coffee} 0.03315085 0.5751979 0.05763382 1.186481
## [4] {Sandwich} => {Coffee} 0.04257908 0.5679513 0.07496959 1.171533
## [5] {Pastry} => {Coffee} 0.04896594 0.5590278 0.08759124 1.153126
## [6] {Alfajores} => {Coffee} 0.02250608 0.5522388 0.04075426 1.139122
## [7] {Tiffin} => {Coffee} 0.01064477 0.5468750 0.01946472 1.128058
## [8] {Scone} => {Coffee} 0.01855231 0.5422222 0.03421533 1.118461
## [9] {Cake} => {Coffee} 0.05687348 0.5389049 0.10553528 1.111618
## [10] {Juice} => {Coffee} 0.02144161 0.5300752 0.04045012 1.093405
## [11] {Cookies} => {Coffee} 0.02995742 0.5267380 0.05687348 1.086521
## [12] {Hot chocolate} => {Coffee} 0.02737226 0.5263158 0.05200730 1.085650
## count
## [1] 170
## [2] 93
## [3] 218
## [4] 280
## [5] 322
## [6] 148
## [7] 70
## [8] 122
## [9] 374
## [10] 141
## [11] 197
## [12] 180
To extract only the statistically significant rules, the Fisher test is used:
is.significant(rules.by.conf, tr)
## [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
It turns out that only the first 5 rules are significant, but in order to propperly demonstrate the visualization tools for association rule mining, they wont be excluded from the rule set.
plot(rules.by.conf, method="grouped")
The plot above visualizes the results of the analysis, by showing the support and lift of each of the found rules. Below is another way to visualize them.
plot(rules.by.conf, method="graph", shading="lift")
Of course, since all of the rules are similar in shape, with all of them having Coffee as the RHS item, these plots are not that informative, however in case of different datasets, they may be of great use.
Below, I have tried to identify rules with a different item on the RHS, with a descent support. My first idea was to try the second most often occuring item - bread.
rules.bread <- apriori(data=trans, parameter=list(supp=0.01, conf = 0.3, target="rules"), appearance = list(default="lhs", rhs="Bread"), control=list(verbose=F))
rules.bread.byconf <- sort(rules.bread, by="confidence", decreasing=TRUE)
inspect(head(rules.bread.byconf))
## lhs rhs support confidence coverage lift count
## [1] {Pastry} => {Bread} 0.02980535 0.3402778 0.08759124 1.042715 196
## [2] {} => {Bread} 0.32633820 0.3263382 1.00000000 1.000000 2146
The analysis yeilded 1 rule, between Pastry and bread, however its lift and support are negligible, and therefore this rule does not have a lot of business value. The arules library also allows for visualization of these individual rules, since they are stored in the same format as the combined rules. In this case, as the quality of the rule is not the best, the quality of the plot is lacking, as the library was not designed to handle such an analysis.
plot(rules.bread, method="graph", shading="lift")
#Jaccard index
Another way of identifying how the items are related to each other is Jaccard index. It tell how likely it is, that a given two items are bought together. The output of the method is a dissimilarity matrix which looks as follows:
trans.sel<-trans[,itemFrequency(trans)>0.05]
d.jac.i<-dissimilarity(trans.sel, which="items")
round(d.jac.i,2)
## Bread Cake Coffee Cookies Hot chocolate Medialuna Pastry Sandwich
## Cake 0.94
## Coffee 0.87 0.89
## Cookies 0.96 0.95 0.94
## Hot chocolate 0.97 0.93 0.95 0.95
## Medialuna 0.96 0.97 0.93 0.98 0.96
## Pastry 0.92 0.97 0.91 0.97 0.96 0.94
## Sandwich 0.96 0.96 0.92 0.98 0.97 0.98 0.99
## Tea 0.93 0.88 0.91 0.95 0.96 0.96 0.96 0.93
Here, we can read from the output, that the least likely item pairs are tea+cake, coffee+bread and coffee+cake. The first one is reasonable, however the rules involving coffee are probably not very reliable, as they stem from the fact that both coffee and bread are extremely often bought alone.