Association rules are methods used for exploring relationships between data. They are highly useful in companies, because by them companies may take advantage of consumer data and use it to examine consumer behavior. Greater knowledge about consumer can be used to reach higher profits. In this paper I will find association rules which are typical for bakery customers. Dataset, which contains data about transactions in bakery, was downloaded from https://www.kaggle.com/sulmansarwar/transactions-from-a-bakery/version/1
Firstly, I load necessary libraries.
library(arules)
library(arulesViz)
library(kableExtra)
Dataset for association rules techniques needs to be appropriately prepared. Hence, I use read.transactions and specify 2 columns - “Transaction” and “Item”. Additionally, I check the number of baskets and unique items.
bakery <- read.transactions("Bakery.csv", format="single", sep=",", cols=c("Transaction","Item"), header=TRUE)
summary(size(bakery))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.089 3.000 10.000
cat("Number of baskets:", length(bakery))
## Number of baskets: 6613
cat("Number of unique items:", sum(size(bakery)))
## Number of unique items: 13816
In order to become acquainted with data I plot absolute and relative item frequency.
itemFrequencyPlot(bakery, topN = 10, type = "absolute", main = "Item frequency", cex.names = 0.75)
itemFrequencyPlot(bakery, topN = 10, type = "relative", main = "Item frequency", cex.names = 0.75)
The most ordered products are coffee and bread. It is interesting because dataset comes from bakery so expectable the most ordered product was bread, not coffee.
The Eclat algorithm stands for Equivalence Class Clustering and bottom-up Lattice Traversal. It is used to identify frequent patterns in a transaction data. Eclat algorithm is a more efficient and scalable version of the Apriori algorithm. While the Apriori algorithm works in a horizontal sense imitating the Breadth-First Search of a graph, the Eclat algorithm works in a vertical manner just like the Depth-First Search of a graph. I use Apriori algorithm in a further part of this project to show differences in results between these two algorithms. More about Eclat algorithm: https://www.geeksforgeeks.org/ml-eclat-algorithm/
To understand this topic fully, I remind 3 measures:
For association rule X ==> Y:
rules_e <-eclat(bakery, parameter=list(supp=0.05))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.05 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 330
##
## create itemset ...
## set transactions ...[103 item(s), 6613 transaction(s)] done [0.00s].
## sorting and recoding items ... [10 item(s)] done [0.00s].
## creating bit matrix ... [10 row(s), 6613 column(s)] done [0.00s].
## writing ... [13 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
rules_e_i <- inspect(head(sort(rules_e, by = "support"), 15))
kable(rules_e_i, "html") %>% kable_styling("striped")
| items | support | transIdenticalToItemsets | count | |
|---|---|---|---|---|
| [1] | {Coffee} | 0.4820808 | 3188 | 3188 |
| [2] | {Bread} | 0.3245123 | 2146 | 2146 |
| [3] | {Tea} | 0.1422955 | 941 | 941 |
| [4] | {Cake} | 0.1049448 | 694 | 694 |
| [5] | {Bread,Coffee} | 0.0898231 | 594 | 594 |
| [6] | {Pastry} | 0.0871012 | 576 | 576 |
| [7] | {Sandwich} | 0.0745501 | 493 | 493 |
| [8] | {NONE} | 0.0718282 | 475 | 475 |
| [9] | {Medialuna} | 0.0573114 | 379 | 379 |
| [10] | {Cake,Coffee} | 0.0565553 | 374 | 374 |
| [11] | {Cookies} | 0.0565553 | 374 | 374 |
| [12] | {Coffee,Tea} | 0.0517163 | 342 | 342 |
| [13] | {Hot chocolate} | 0.0517163 | 342 | 342 |
By using eclat() I do not create rules. Eclat algorithm digs through frequent sets to limit the data set. I obtain frequent sets and measure values determined for them. If we look on support, we can notice that coffee and bread greatly vary from the rest. These support results were visible on “item frequency by relative” plot.
rules_eclat<-ruleInduction(rules_e, bakery, confidence=0.05)
rules_eclat_i <- inspect(head(sort(rules_eclat, by = "confidence", decreasing = TRUE),15))
kable(rules_eclat_i, "html") %>% kable_styling("striped")
| lhs | rhs | support | confidence | lift | itemset | ||
|---|---|---|---|---|---|---|---|
| [1] | {Cake} | => | {Coffee} | 0.0565553 | 0.5389049 | 1.1178727 | 1 |
| [2] | {Tea} | => | {Coffee} | 0.0517163 | 0.3634431 | 0.7539051 | 2 |
| [3] | {Bread} | => | {Coffee} | 0.0898231 | 0.2767940 | 0.5741653 | 3 |
| [4] | {Coffee} | => | {Bread} | 0.0898231 | 0.1863237 | 0.5741653 | 3 |
| [5] | {Coffee} | => | {Cake} | 0.0565553 | 0.1173149 | 1.1178727 | 1 |
| [6] | {Coffee} | => | {Tea} | 0.0517163 | 0.1072773 | 0.7539051 | 2 |
In order to create rules I use ruleInduction() with confidence = 0.05. Thanks to it I have obtained 6 rules, but for only 2 rules lift value is higher than 1.
rules_eclat<-ruleInduction(eclat(bakery, parameter=list(supp=0.02)) , bakery, confidence=0.05)
rules_supp_i <- inspect(head(sort(rules_eclat, by="lift", decreasing=TRUE),15), linebreak=F)
kable(rules_supp_i, "html") %>% kable_styling("striped")
| lhs | rhs | support | confidence | lift | itemset | ||
|---|---|---|---|---|---|---|---|
| [1] | {Tea} | => | {Cake} | 0.0261606 | 0.1838470 | 1.751844 | 14 |
| [2] | {Cake} | => | {Tea} | 0.0261606 | 0.2492795 | 1.751844 | 14 |
| [3] | {Toast} | => | {Coffee} | 0.0257069 | 0.7296137 | 1.513468 | 1 |
| [4] | {Coffee} | => | {Toast} | 0.0257069 | 0.0533250 | 1.513468 | 1 |
| [5] | {NONE} | => | {Coffee} | 0.0417360 | 0.5810526 | 1.205302 | 8 |
| [6] | {Coffee} | => | {NONE} | 0.0417360 | 0.0865747 | 1.205302 | 8 |
| [7] | {Medialuna} | => | {Coffee} | 0.0329654 | 0.5751979 | 1.193157 | 6 |
| [8] | {Coffee} | => | {Medialuna} | 0.0329654 | 0.0683814 | 1.193157 | 6 |
| [9] | {Sandwich} | => | {Coffee} | 0.0423408 | 0.5679513 | 1.178125 | 9 |
| [10] | {Coffee} | => | {Sandwich} | 0.0423408 | 0.0878294 | 1.178125 | 9 |
| [11] | {Pastry} | => | {Coffee} | 0.0486920 | 0.5590278 | 1.159614 | 10 |
| [12] | {Coffee} | => | {Pastry} | 0.0486920 | 0.1010038 | 1.159614 | 10 |
| [13] | {Alfajores} | => | {Coffee} | 0.0223802 | 0.5522388 | 1.145532 | 2 |
| [14] | {Coffee} | => | {Cake} | 0.0565553 | 0.1173149 | 1.117873 | 12 |
| [15] | {Cake} | => | {Coffee} | 0.0565553 | 0.5389049 | 1.117873 | 12 |
rules_eclat
## set of 31 rules
Changing support value from 0.05 to 0.02 causes that more rules have lift value higher than 1, but not only lift matters. I get 31 rules but very little confidence so I decide to raise this value.
rules_eclat_2<-ruleInduction(eclat(bakery, parameter=list(supp=0.02)) , bakery, confidence=0.3)
rules_supp_i_2 <- inspect(head(sort(rules_eclat_2, by="lift", decreasing=TRUE),15), linebreak=F)
kable(rules_supp_i_2, "html") %>% kable_styling("striped")
| lhs | rhs | support | confidence | lift | itemset | ||
|---|---|---|---|---|---|---|---|
| [1] | {Toast} | => | {Coffee} | 0.0257069 | 0.7296137 | 1.5134679 | 1 |
| [2] | {NONE} | => | {Coffee} | 0.0417360 | 0.5810526 | 1.2053015 | 8 |
| [3] | {Medialuna} | => | {Coffee} | 0.0329654 | 0.5751979 | 1.1931567 | 6 |
| [4] | {Sandwich} | => | {Coffee} | 0.0423408 | 0.5679513 | 1.1781249 | 9 |
| [5] | {Pastry} | => | {Coffee} | 0.0486920 | 0.5590278 | 1.1596144 | 10 |
| [6] | {Alfajores} | => | {Coffee} | 0.0223802 | 0.5522388 | 1.1455318 | 2 |
| [7] | {Cake} | => | {Coffee} | 0.0565553 | 0.5389049 | 1.1178727 | 12 |
| [8] | {Juice} | => | {Coffee} | 0.0213216 | 0.5300752 | 1.0995568 | 4 |
| [9] | {Cookies} | => | {Coffee} | 0.0297898 | 0.5267380 | 1.0926343 | 7 |
| [10] | {Hot chocolate} | => | {Coffee} | 0.0272191 | 0.5263158 | 1.0917586 | 5 |
| [11] | {Pastry} | => | {Bread} | 0.0296386 | 0.3402778 | 1.0485820 | 11 |
| [12] | {Brownie} | => | {Coffee} | 0.0208680 | 0.4758621 | 0.9871003 | 3 |
| [13] | {Tea} | => | {Coffee} | 0.0517163 | 0.3634431 | 0.7539051 | 15 |
rules_eclat_2
## set of 13 rules
After raising confidence to 0.3 I obtain 13 rules where lift value for most of them is above 1. Only for one rule, rule consequent is bread, for others it is coffee.
plot(rules_eclat_2, method="graph", shading="lift")
Results described in a graphic way.
plot(rules_eclat_2, method="graph", measure="support", shading="lift", engine="html")
The same plot as above but in dynamic version.
In this part I try Apriori algorithm with the same support and confidence values as in Eclat algorithm.
rules_apriori<-apriori(bakery, parameter=list(supp=0.02, conf=0.3))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.3 0.1 1 none FALSE TRUE 5 0.02 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 132
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[103 item(s), 6613 transaction(s)] done [0.00s].
## sorting and recoding items ... [22 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules_apriori_i <- inspect(head(sort(rules_apriori, by = "lift", decreasing = TRUE),15))
kable(rules_apriori_i , "html") %>% kable_styling("striped")
| lhs | rhs | support | confidence | coverage | lift | count | ||
|---|---|---|---|---|---|---|---|---|
| [1] | {Toast} | => | {Coffee} | 0.0257069 | 0.7296137 | 0.0352336 | 1.5134679 | 170 |
| [2] | {NONE} | => | {Coffee} | 0.0417360 | 0.5810526 | 0.0718282 | 1.2053015 | 276 |
| [3] | {Medialuna} | => | {Coffee} | 0.0329654 | 0.5751979 | 0.0573114 | 1.1931567 | 218 |
| [4] | {Sandwich} | => | {Coffee} | 0.0423408 | 0.5679513 | 0.0745501 | 1.1781249 | 280 |
| [5] | {Pastry} | => | {Coffee} | 0.0486920 | 0.5590278 | 0.0871012 | 1.1596144 | 322 |
| [6] | {Alfajores} | => | {Coffee} | 0.0223802 | 0.5522388 | 0.0405262 | 1.1455318 | 148 |
| [7] | {Cake} | => | {Coffee} | 0.0565553 | 0.5389049 | 0.1049448 | 1.1178727 | 374 |
| [8] | {Juice} | => | {Coffee} | 0.0213216 | 0.5300752 | 0.0402238 | 1.0995568 | 141 |
| [9] | {Cookies} | => | {Coffee} | 0.0297898 | 0.5267380 | 0.0565553 | 1.0926343 | 197 |
| [10] | {Hot chocolate} | => | {Coffee} | 0.0272191 | 0.5263158 | 0.0517163 | 1.0917586 | 180 |
| [11] | {Pastry} | => | {Bread} | 0.0296386 | 0.3402778 | 0.0871012 | 1.0485820 | 196 |
| [12] | {} | => | {Bread} | 0.3245123 | 0.3245123 | 1.0000000 | 1.0000000 | 2146 |
| [13] | {} | => | {Coffee} | 0.4820808 | 0.4820808 | 1.0000000 | 1.0000000 | 3188 |
| [14] | {Brownie} | => | {Coffee} | 0.0208680 | 0.4758621 | 0.0438530 | 0.9871003 | 138 |
| [15] | {Tea} | => | {Coffee} | 0.0517163 | 0.3634431 | 0.1422955 | 0.7539051 | 342 |
rules_apriori
## set of 15 rules
The results are nearly the same as using Eclat algorithm. Difference occurs in a number of rules, because in Apriori there are 15 rules, while in Eclat 13 rules but 2 rules in Apriori are uninterpretable because lhs for them is equal to “{}” so conclusions from these two algorithms are the same. As I mentioned in previous part, Eclat algorithm is a more efficient and scalable version of the Apriori algorithm thus the decision which algorithm should be used depends of the dataset.
plot(rules_apriori, method="graph", measure="support", shading="lift", engine="html")
Plot for Apriori algorithm.
Sometimes there is necessity to analyze particular product, not whole basket of products. In this part I focus on analyzing only one product.
rules_bread<-apriori(data=bakery, parameter=list(supp=0.01,conf = 0.1),
appearance=list(default="lhs",rhs="Bread"), control=list(verbose=F))
rules_bread<-sort(rules_bread, by="support", decreasing=T)
inspect(head(rules_bread))
## lhs rhs support confidence coverage lift count
## [1] {} => {Bread} 0.32451232 0.3245123 1.00000000 1.0000000 2146
## [2] {Coffee} => {Bread} 0.08982308 0.1863237 0.48208075 0.5741653 594
## [3] {Pastry} => {Bread} 0.02963859 0.3402778 0.08710116 1.0485820 196
## [4] {Tea} => {Bread} 0.02948737 0.2072264 0.14229548 0.6385778 195
## [5] {Cake} => {Bread} 0.02328746 0.2219020 0.10494481 0.6838015 154
## [6] {NONE} => {Bread} 0.01875095 0.2610526 0.07182822 0.8044460 124
Here I check rules where rule consequent is only bread. Algorithm has printed some rules, but only one of them has lift value above 1 while lift values under 1 are not desirable. Even this one rule is not worth further research because support value is only 0.09, confidence is 0.34 and lift is 1.04.
rules_coffee<-apriori(data=bakery, parameter=list(supp=0.01,conf = 0.3, minlen=3),
appearance=list(default="lhs",rhs="Coffee"), control=list(verbose=F))
rules_coffee<-sort(rules_coffee, by="support", decreasing=T)
inspect(head(rules_coffee))
## lhs rhs support confidence coverage lift count
## [1] {Bread,Pastry} => {Coffee} 0.01134130 0.3826531 0.02963859 0.7937530 75
## [2] {Cake,Tea} => {Coffee} 0.01119008 0.4277457 0.02616059 0.8872905 74
There is possibility to check rule consequent as a basket with many products. I have chosen 2 products. In this case the statistics are too low to take received rules into account in creating strategy for company.
Association rules are very useful in setting better strategy of product placement not only for huge companies but also for small businesses. In this project I have shown example use of the Eclat and Apriori algorithm to determine association rules for bakery. I have not focused on the results from my research, because this project is created to interest the reader with association rules not to explain fully what it is. If I have met my goal, here is one of many websites where you can acquire knowledge about association rules: https://towardsdatascience.com/association-rules-2-aa9a77241654