Introduction

Association rules are methods used for exploring relationships between data. They are highly useful in companies, because by them companies may take advantage of consumer data and use it to examine consumer behavior. Greater knowledge about consumer can be used to reach higher profits. In this paper I will find association rules which are typical for bakery customers. Dataset, which contains data about transactions in bakery, was downloaded from https://www.kaggle.com/sulmansarwar/transactions-from-a-bakery/version/1

Libraries

Firstly, I load necessary libraries.

library(arules)
library(arulesViz)
library(kableExtra)

Dataset for association rules techniques needs to be appropriately prepared. Hence, I use read.transactions and specify 2 columns - “Transaction” and “Item”. Additionally, I check the number of baskets and unique items.

bakery <- read.transactions("Bakery.csv", format="single", sep=",", cols=c("Transaction","Item"), header=TRUE)

summary(size(bakery))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.089   3.000  10.000

cat("Number of baskets:", length(bakery))

## Number of baskets: 6613

cat("Number of unique items:", sum(size(bakery)))

## Number of unique items: 13816

In order to become acquainted with data I plot absolute and relative item frequency.

itemFrequencyPlot(bakery, topN = 10, type = "absolute", main = "Item frequency", cex.names = 0.75)

itemFrequencyPlot(bakery, topN = 10, type = "relative", main = "Item frequency", cex.names = 0.75)

The most ordered products are coffee and bread. It is interesting because dataset comes from bakery so expectable the most ordered product was bread, not coffee.

Eclat algorithm

The Eclat algorithm stands for Equivalence Class Clustering and bottom-up Lattice Traversal. It is used to identify frequent patterns in a transaction data. Eclat algorithm is a more efficient and scalable version of the Apriori algorithm. While the Apriori algorithm works in a horizontal sense imitating the Breadth-First Search of a graph, the Eclat algorithm works in a vertical manner just like the Depth-First Search of a graph. I use Apriori algorithm in a further part of this project to show differences in results between these two algorithms. More about Eclat algorithm: https://www.geeksforgeeks.org/ml-eclat-algorithm/

To understand this topic fully, I remind 3 measures:

support - it gives information about frequency of an itemset or a rule in the data.
confidence - it is the percentage in which the consequent is also satisfied upon particular antecedent.
lift - it shows the strength of the rule (higher lift –> better rule). It is calculated by formula: lift(X –> Y) = confidence(X –> Y) / support(Y)

For association rule X ==> Y:

lift > 1 means that X and Y are positively correlated,
lift = 1 means that X and Y are independent,
lift < 1 means that X and Y are negatively correlated.

rules_e <-eclat(bakery, parameter=list(supp=0.05))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.05      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 330 
## 
## create itemset ... 
## set transactions ...[103 item(s), 6613 transaction(s)] done [0.00s].
## sorting and recoding items ... [10 item(s)] done [0.00s].
## creating bit matrix ... [10 row(s), 6613 column(s)] done [0.00s].
## writing  ... [13 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

rules_e_i <- inspect(head(sort(rules_e, by = "support"), 15))

kable(rules_e_i, "html") %>% kable_styling("striped")

	items	support	transIdenticalToItemsets	count
[1]	{Coffee}	0.4820808	3188	3188
[2]	{Bread}	0.3245123	2146	2146
[3]	{Tea}	0.1422955	941	941
[4]	{Cake}	0.1049448	694	694
[5]	{Bread,Coffee}	0.0898231	594	594
[6]	{Pastry}	0.0871012	576	576
[7]	{Sandwich}	0.0745501	493	493
[8]	{NONE}	0.0718282	475	475
[9]	{Medialuna}	0.0573114	379	379
[10]	{Cake,Coffee}	0.0565553	374	374
[11]	{Cookies}	0.0565553	374	374
[12]	{Coffee,Tea}	0.0517163	342	342
[13]	{Hot chocolate}	0.0517163	342	342

By using eclat() I do not create rules. Eclat algorithm digs through frequent sets to limit the data set. I obtain frequent sets and measure values determined for them. If we look on support, we can notice that coffee and bread greatly vary from the rest. These support results were visible on “item frequency by relative” plot.

rules_eclat<-ruleInduction(rules_e, bakery, confidence=0.05)
rules_eclat_i <- inspect(head(sort(rules_eclat, by = "confidence", decreasing = TRUE),15))

kable(rules_eclat_i, "html") %>% kable_styling("striped")

	lhs		rhs	support	confidence	lift	itemset
[1]	{Cake}	=>	{Coffee}	0.0565553	0.5389049	1.1178727	1
[2]	{Tea}	=>	{Coffee}	0.0517163	0.3634431	0.7539051	2
[3]	{Bread}	=>	{Coffee}	0.0898231	0.2767940	0.5741653	3
[4]	{Coffee}	=>	{Bread}	0.0898231	0.1863237	0.5741653	3
[5]	{Coffee}	=>	{Cake}	0.0565553	0.1173149	1.1178727	1
[6]	{Coffee}	=>	{Tea}	0.0517163	0.1072773	0.7539051	2

In order to create rules I use ruleInduction() with confidence = 0.05. Thanks to it I have obtained 6 rules, but for only 2 rules lift value is higher than 1.

rules_eclat<-ruleInduction(eclat(bakery, parameter=list(supp=0.02)) , bakery, confidence=0.05)
rules_supp_i <- inspect(head(sort(rules_eclat, by="lift", decreasing=TRUE),15), linebreak=F)

kable(rules_supp_i, "html") %>% kable_styling("striped")

	lhs		rhs	support	confidence	lift	itemset
[1]	{Tea}	=>	{Cake}	0.0261606	0.1838470	1.751844	14
[2]	{Cake}	=>	{Tea}	0.0261606	0.2492795	1.751844	14
[3]	{Toast}	=>	{Coffee}	0.0257069	0.7296137	1.513468	1
[4]	{Coffee}	=>	{Toast}	0.0257069	0.0533250	1.513468	1
[5]	{NONE}	=>	{Coffee}	0.0417360	0.5810526	1.205302	8
[6]	{Coffee}	=>	{NONE}	0.0417360	0.0865747	1.205302	8
[7]	{Medialuna}	=>	{Coffee}	0.0329654	0.5751979	1.193157	6
[8]	{Coffee}	=>	{Medialuna}	0.0329654	0.0683814	1.193157	6
[9]	{Sandwich}	=>	{Coffee}	0.0423408	0.5679513	1.178125	9
[10]	{Coffee}	=>	{Sandwich}	0.0423408	0.0878294	1.178125	9
[11]	{Pastry}	=>	{Coffee}	0.0486920	0.5590278	1.159614	10
[12]	{Coffee}	=>	{Pastry}	0.0486920	0.1010038	1.159614	10
[13]	{Alfajores}	=>	{Coffee}	0.0223802	0.5522388	1.145532	2
[14]	{Coffee}	=>	{Cake}	0.0565553	0.1173149	1.117873	12
[15]	{Cake}	=>	{Coffee}	0.0565553	0.5389049	1.117873	12

rules_eclat

## set of 31 rules

Changing support value from 0.05 to 0.02 causes that more rules have lift value higher than 1, but not only lift matters. I get 31 rules but very little confidence so I decide to raise this value.

rules_eclat_2<-ruleInduction(eclat(bakery, parameter=list(supp=0.02)) , bakery, confidence=0.3)
rules_supp_i_2 <- inspect(head(sort(rules_eclat_2, by="lift", decreasing=TRUE),15), linebreak=F)

kable(rules_supp_i_2, "html") %>% kable_styling("striped")

	lhs		rhs	support	confidence	lift	itemset
[1]	{Toast}	=>	{Coffee}	0.0257069	0.7296137	1.5134679	1
[2]	{NONE}	=>	{Coffee}	0.0417360	0.5810526	1.2053015	8
[3]	{Medialuna}	=>	{Coffee}	0.0329654	0.5751979	1.1931567	6
[4]	{Sandwich}	=>	{Coffee}	0.0423408	0.5679513	1.1781249	9
[5]	{Pastry}	=>	{Coffee}	0.0486920	0.5590278	1.1596144	10
[6]	{Alfajores}	=>	{Coffee}	0.0223802	0.5522388	1.1455318	2
[7]	{Cake}	=>	{Coffee}	0.0565553	0.5389049	1.1178727	12
[8]	{Juice}	=>	{Coffee}	0.0213216	0.5300752	1.0995568	4
[9]	{Cookies}	=>	{Coffee}	0.0297898	0.5267380	1.0926343	7
[10]	{Hot chocolate}	=>	{Coffee}	0.0272191	0.5263158	1.0917586	5
[11]	{Pastry}	=>	{Bread}	0.0296386	0.3402778	1.0485820	11
[12]	{Brownie}	=>	{Coffee}	0.0208680	0.4758621	0.9871003	3
[13]	{Tea}	=>	{Coffee}	0.0517163	0.3634431	0.7539051	15

rules_eclat_2

## set of 13 rules

After raising confidence to 0.3 I obtain 13 rules where lift value for most of them is above 1. Only for one rule, rule consequent is bread, for others it is coffee.

plot(rules_eclat_2, method="graph", shading="lift")

Results described in a graphic way.

plot(rules_eclat_2, method="graph", measure="support", shading="lift", engine="html")

The same plot as above but in dynamic version.

Apriori algorithm

In this part I try Apriori algorithm with the same support and confidence values as in Eclat algorithm.

rules_apriori<-apriori(bakery, parameter=list(supp=0.02, conf=0.3))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5    0.02      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 132 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[103 item(s), 6613 transaction(s)] done [0.00s].
## sorting and recoding items ... [22 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules_apriori_i <-  inspect(head(sort(rules_apriori, by = "lift", decreasing = TRUE),15))

kable(rules_apriori_i , "html") %>% kable_styling("striped")

	lhs		rhs	support	confidence	coverage	lift	count
[1]	{Toast}	=>	{Coffee}	0.0257069	0.7296137	0.0352336	1.5134679	170
[2]	{NONE}	=>	{Coffee}	0.0417360	0.5810526	0.0718282	1.2053015	276
[3]	{Medialuna}	=>	{Coffee}	0.0329654	0.5751979	0.0573114	1.1931567	218
[4]	{Sandwich}	=>	{Coffee}	0.0423408	0.5679513	0.0745501	1.1781249	280
[5]	{Pastry}	=>	{Coffee}	0.0486920	0.5590278	0.0871012	1.1596144	322
[6]	{Alfajores}	=>	{Coffee}	0.0223802	0.5522388	0.0405262	1.1455318	148
[7]	{Cake}	=>	{Coffee}	0.0565553	0.5389049	0.1049448	1.1178727	374
[8]	{Juice}	=>	{Coffee}	0.0213216	0.5300752	0.0402238	1.0995568	141
[9]	{Cookies}	=>	{Coffee}	0.0297898	0.5267380	0.0565553	1.0926343	197
[10]	{Hot chocolate}	=>	{Coffee}	0.0272191	0.5263158	0.0517163	1.0917586	180
[11]	{Pastry}	=>	{Bread}	0.0296386	0.3402778	0.0871012	1.0485820	196
[12]	{}	=>	{Bread}	0.3245123	0.3245123	1.0000000	1.0000000	2146
[13]	{}	=>	{Coffee}	0.4820808	0.4820808	1.0000000	1.0000000	3188
[14]	{Brownie}	=>	{Coffee}	0.0208680	0.4758621	0.0438530	0.9871003	138
[15]	{Tea}	=>	{Coffee}	0.0517163	0.3634431	0.1422955	0.7539051	342

rules_apriori

## set of 15 rules

The results are nearly the same as using Eclat algorithm. Difference occurs in a number of rules, because in Apriori there are 15 rules, while in Eclat 13 rules but 2 rules in Apriori are uninterpretable because lhs for them is equal to “{}” so conclusions from these two algorithms are the same. As I mentioned in previous part, Eclat algorithm is a more efficient and scalable version of the Apriori algorithm thus the decision which algorithm should be used depends of the dataset.

plot(rules_apriori, method="graph", measure="support", shading="lift", engine="html")

Plot for Apriori algorithm.

Rules for particular product

Sometimes there is necessity to analyze particular product, not whole basket of products. In this part I focus on analyzing only one product.

rules_bread<-apriori(data=bakery, parameter=list(supp=0.01,conf = 0.1), 
                          appearance=list(default="lhs",rhs="Bread"), control=list(verbose=F)) 
rules_bread<-sort(rules_bread, by="support", decreasing=T)
inspect(head(rules_bread))

##     lhs         rhs     support    confidence coverage   lift      count
## [1] {}       => {Bread} 0.32451232 0.3245123  1.00000000 1.0000000 2146 
## [2] {Coffee} => {Bread} 0.08982308 0.1863237  0.48208075 0.5741653  594 
## [3] {Pastry} => {Bread} 0.02963859 0.3402778  0.08710116 1.0485820  196 
## [4] {Tea}    => {Bread} 0.02948737 0.2072264  0.14229548 0.6385778  195 
## [5] {Cake}   => {Bread} 0.02328746 0.2219020  0.10494481 0.6838015  154 
## [6] {NONE}   => {Bread} 0.01875095 0.2610526  0.07182822 0.8044460  124

Here I check rules where rule consequent is only bread. Algorithm has printed some rules, but only one of them has lift value above 1 while lift values under 1 are not desirable. Even this one rule is not worth further research because support value is only 0.09, confidence is 0.34 and lift is 1.04.

rules_coffee<-apriori(data=bakery, parameter=list(supp=0.01,conf = 0.3, minlen=3), 
                          appearance=list(default="lhs",rhs="Coffee"), control=list(verbose=F)) 
rules_coffee<-sort(rules_coffee, by="support", decreasing=T)
inspect(head(rules_coffee))

##     lhs               rhs      support    confidence coverage   lift      count
## [1] {Bread,Pastry} => {Coffee} 0.01134130 0.3826531  0.02963859 0.7937530 75   
## [2] {Cake,Tea}     => {Coffee} 0.01119008 0.4277457  0.02616059 0.8872905 74

There is possibility to check rule consequent as a basket with many products. I have chosen 2 products. In this case the statistics are too low to take received rules into account in creating strategy for company.

Conclusions

Association rules are very useful in setting better strategy of product placement not only for huge companies but also for small businesses. In this project I have shown example use of the Eclat and Apriori algorithm to determine association rules for bakery. I have not focused on the results from my research, because this project is created to interest the reader with association rules not to explain fully what it is. If I have met my goal, here is one of many websites where you can acquire knowledge about association rules: https://towardsdatascience.com/association-rules-2-aa9a77241654