To discover patterns within the client’s behavior, many companies and small enterprises research associations between purchased products. It helps in recognizing which products are bought more frequently than in others. This knowledge may contribute to a significant rise in profits simply by preparing promotions or packages to sell particular products jointly—even the store’s place, where the particular products are significant in clients’ decision-making process. The field of Data Science, which covers these activities, is usually called Association Rule Mining, and rules according to which clients decide to buy some products are called Association Rules.
In this paper, using apriori and eclat algorithms, I will analyze association rules and discover patterns, which occur in the examined dataset. I will consider rules regarding two frequently bought products - classic croissants and americano to enrich my analysis.
Our dataset comes from the kaggle, and it presents delivery data from some bakery store in Korea (https://www.kaggle.com/hosubjeong/bakery-sales). Data were gathered from 11 July 2019 till 18 June 2020. Due to lack of observations, we obtained 2420 observations.
bakery<-read.transactions("bakerysales1.csv", format="basket", sep=";", skip=0)
bakery=bakery[1:2420]
inspect(head(bakery))
## items
## [1] {americano,angbutter,tiramisu.croissant,vanila.latte}
## [2] {angbutter,orange.pound,tiramisu.croissant}
## [3] {tiramisu.croissant}
## [4] {angbutter,plain.bread,vanila.latte}
## [5] {angbutter,tiramisu.croissant}
## [6] {angbutter,milk.tea,vanila.latte}
Now let’s examine the most frequently bought products among the available ones.
itemFrequency(bakery,type="absolute")
## almond.croissant americano angbutter berry.ade
## 202 412 1973 54
## cacao.deep caffe.latte cheese.cake croissant
## 323 193 90 747
## gateau.chocolat jam lemonade merinque.cookies
## 196 220 35 47
## milk.tea orange.pound pain.au.chocolat pandoro
## 137 519 587 343
## plain.bread tiramisu tiramisu.croissant vanila.latte
## 857 7 779 209
## wiener
## 355
itemFrequencyPlot(bakery, topN=12, type="relative", main="Item Frequency", col="purple")
itemFrequencyPlot(bakery, topN=12, type="absolute", main="Item Frequency", col="purple")
As we can see from the both plots, angbutter (Pretzel filled with red beans and gourmet butter) had been bought in the biggest number of orders. The next ones are plain bread, and croissants (classic and tiramisu). Among beverages, coffe americano was the most popular. It had been ordered 412 times.
In order to asses the power of the rule, we can use Support and Confidence measure. Support gives us the answer, how many times a particular rule is applicable for a given dataset. Confidence informs us about the reliability of the interference made by the rule. We can also look as confidence as a conditional probability of B given A. (https://www-users.cs.umn.edu/~kumar001/dmbook/ch6.pdf). In simpler words, Support is the ratio between the observations in which X and Y were ordered together to the total number of orders. Confidence is the probability of buying of buying B under condition that we already have A in our basket.
Another two usually used metrcis are Expected confidence and Lift. Expected confidence is the probability of occurrence of the antecedent, if it was independent. Lift is the ratio between confidence and expected confidence. (https://pub.towardsai.net/association-discovery-the-apriori-algorithm-28c1e71e0f04)
Our next is to find the frequently bought basket (itemsets). To perform this, we will use Eclat. It is an algorithm that digs into a dataset and finds the most frequent itemsets. It does not create the rules. Together with itemsets, we obtain the measure (usually support) of each itemset. I will set min. supp 0.15.
freq.items<-eclat(bakery, parameter=list(supp=0.15, maxlen=15))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.15 1 15 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 363
##
## create itemset ...
## set transactions ...[21 item(s), 2420 transaction(s)] done [0.00s].
## sorting and recoding items ... [7 item(s)] done [0.00s].
## creating bit matrix ... [7 row(s), 2420 column(s)] done [0.00s].
## writing ... [12 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(freq.items)
## items support transIdenticalToItemsets count
## [1] {angbutter,orange.pound} 0.1677686 406 406
## [2] {angbutter,pain.au.chocolat} 0.1818182 440 440
## [3] {angbutter,tiramisu.croissant} 0.2483471 601 601
## [4] {angbutter,croissant} 0.2305785 558 558
## [5] {angbutter,plain.bread} 0.2677686 648 648
## [6] {angbutter} 0.8152893 1973 1973
## [7] {plain.bread} 0.3541322 857 857
## [8] {croissant} 0.3086777 747 747
## [9] {tiramisu.croissant} 0.3219008 779 779
## [10] {pain.au.chocolat} 0.2425620 587 587
## [11] {orange.pound} 0.2144628 519 519
## [12] {americano} 0.1702479 412 412
According to the obtained results, angbutter, plain bread, and tiramisu croissant are the most frequently ordered products from this bakery. When it comes to itemsets with more than one product, angbutter is jointed with the biggest number of orders.
The apriori aim is the same as eclat - it looks for the most frequent itemsets in the database, but additionally, it creates association rules for the itemsets. These rules inform about relations between items.
Now let’s create these rules, taking into consideration all of the itemsets. This algorithm also requires assuming entry values into our function. After examing the dataset I took support of min. 0.1 and confidence level of 0.2.
rb<-apriori(bakery, parameter=list(supp=0.1, conf=0.2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 242
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[21 item(s), 2420 transaction(s)] done [0.00s].
## sorting and recoding items ... [10 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [24 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(rb)
## lhs rhs support confidence
## [1] {} => {orange.pound} 0.2144628 0.2144628
## [2] {} => {pain.au.chocolat} 0.2425620 0.2425620
## [3] {} => {tiramisu.croissant} 0.3219008 0.3219008
## [4] {} => {croissant} 0.3086777 0.3086777
## [5] {} => {plain.bread} 0.3541322 0.3541322
## [6] {} => {angbutter} 0.8152893 0.8152893
## [7] {cacao.deep} => {angbutter} 0.1012397 0.7585139
## [8] {pandoro} => {angbutter} 0.1090909 0.7696793
## [9] {wiener} => {angbutter} 0.1103306 0.7521127
## [10] {americano} => {angbutter} 0.1347107 0.7912621
## [11] {orange.pound} => {angbutter} 0.1677686 0.7822736
## [12] {angbutter} => {orange.pound} 0.1677686 0.2057780
## [13] {pain.au.chocolat} => {angbutter} 0.1818182 0.7495741
## [14] {angbutter} => {pain.au.chocolat} 0.1818182 0.2230106
## [15] {tiramisu.croissant} => {plain.bread} 0.1016529 0.3157895
## [16] {plain.bread} => {tiramisu.croissant} 0.1016529 0.2870478
## [17] {tiramisu.croissant} => {angbutter} 0.2483471 0.7715019
## [18] {angbutter} => {tiramisu.croissant} 0.2483471 0.3046123
## [19] {croissant} => {plain.bread} 0.1136364 0.3681392
## [20] {plain.bread} => {croissant} 0.1136364 0.3208868
## [21] {croissant} => {angbutter} 0.2305785 0.7469880
## [22] {angbutter} => {croissant} 0.2305785 0.2828180
## [23] {plain.bread} => {angbutter} 0.2677686 0.7561260
## [24] {angbutter} => {plain.bread} 0.2677686 0.3284339
## coverage lift count
## [1] 1.0000000 1.0000000 519
## [2] 1.0000000 1.0000000 587
## [3] 1.0000000 1.0000000 779
## [4] 1.0000000 1.0000000 747
## [5] 1.0000000 1.0000000 857
## [6] 1.0000000 1.0000000 1973
## [7] 0.1334711 0.9303617 245
## [8] 0.1417355 0.9440567 264
## [9] 0.1466942 0.9225102 267
## [10] 0.1702479 0.9705293 326
## [11] 0.2144628 0.9595044 406
## [12] 0.8152893 0.9595044 406
## [13] 0.2425620 0.9193965 440
## [14] 0.8152893 0.9193965 440
## [15] 0.3219008 0.8917276 246
## [16] 0.3541322 0.8917276 246
## [17] 0.3219008 0.9462923 601
## [18] 0.8152893 0.9462923 601
## [19] 0.3086777 1.0395530 275
## [20] 0.3541322 1.0395530 275
## [21] 0.3086777 0.9162245 558
## [22] 0.8152893 0.9162245 558
## [23] 0.3541322 0.9274328 648
## [24] 0.8152893 0.9274328 648
The rules from angle of different measures look as follows
rc<-sort(rb, by="confidence", decreasing=TRUE)
inspect(head(rc))
## lhs rhs support confidence coverage
## [1] {} => {angbutter} 0.8152893 0.8152893 1.0000000
## [2] {americano} => {angbutter} 0.1347107 0.7912621 0.1702479
## [3] {orange.pound} => {angbutter} 0.1677686 0.7822736 0.2144628
## [4] {tiramisu.croissant} => {angbutter} 0.2483471 0.7715019 0.3219008
## [5] {pandoro} => {angbutter} 0.1090909 0.7696793 0.1417355
## [6] {cacao.deep} => {angbutter} 0.1012397 0.7585139 0.1334711
## lift count
## [1] 1.0000000 1973
## [2] 0.9705293 326
## [3] 0.9595044 406
## [4] 0.9462923 601
## [5] 0.9440567 264
## [6] 0.9303617 245
rl<-sort(rb, by="lift", decreasing=TRUE)
inspect(head(rl))
## lhs rhs support confidence coverage
## [1] {croissant} => {plain.bread} 0.1136364 0.3681392 0.3086777
## [2] {plain.bread} => {croissant} 0.1136364 0.3208868 0.3541322
## [3] {} => {orange.pound} 0.2144628 0.2144628 1.0000000
## [4] {} => {pain.au.chocolat} 0.2425620 0.2425620 1.0000000
## [5] {} => {tiramisu.croissant} 0.3219008 0.3219008 1.0000000
## [6] {} => {croissant} 0.3086777 0.3086777 1.0000000
## lift count
## [1] 1.039553 275
## [2] 1.039553 275
## [3] 1.000000 519
## [4] 1.000000 587
## [5] 1.000000 779
## [6] 1.000000 747
plot(rb, method="matrix", measure="lift")
## Itemsets in Antecedent (LHS)
## [1] "{}" "{croissant}" "{americano}"
## [4] "{orange.pound}" "{plain.bread}" "{pandoro}"
## [7] "{angbutter}" "{cacao.deep}" "{wiener}"
## [10] "{pain.au.chocolat}" "{tiramisu.croissant}"
## Itemsets in Consequent (RHS)
## [1] "{angbutter}" "{tiramisu.croissant}" "{pain.au.chocolat}"
## [4] "{plain.bread}" "{orange.pound}" "{croissant}"
rs<-sort(rb, by="support", decreasing=TRUE)
inspect(head(rs))
## lhs rhs support confidence coverage
## [1] {} => {angbutter} 0.8152893 0.8152893 1.0000000
## [2] {} => {plain.bread} 0.3541322 0.3541322 1.0000000
## [3] {} => {tiramisu.croissant} 0.3219008 0.3219008 1.0000000
## [4] {} => {croissant} 0.3086777 0.3086777 1.0000000
## [5] {plain.bread} => {angbutter} 0.2677686 0.7561260 0.3541322
## [6] {angbutter} => {plain.bread} 0.2677686 0.3284339 0.8152893
## lift count
## [1] 1.0000000 1973
## [2] 1.0000000 857
## [3] 1.0000000 779
## [4] 1.0000000 747
## [5] 0.9274328 648
## [6] 0.9274328 648
Rules may be also shown as a plots. Below we may find some of them
plot(rb, method="grouped")
plot(rb, method="graph", control=list(type="items"))
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## main = Graph for 24 rules
## nodeColors = c("#66CC6680", "#9999CC80")
## nodeCol = c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF", "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF", "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol = c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF", "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF", "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha = 0.5
## cex = 1
## itemLabels = TRUE
## labelCol = #000000B3
## measureLabels = FALSE
## precision = 3
## layout = NULL
## layoutParams = list()
## arrowSize = 0.5
## engine = igraph
## plot = TRUE
## plot_options = list()
## max = 100
## verbose = FALSE
plot(rb, method="paracoord", control=list(reorder=TRUE))
As expected, the items that occur most often in our dataset have the highest given measures. The most significant item is angbutter, which is the member of the most frequent orders amongst given data by confidence.
According to support, the products bought the most were under the condition that nothing was bought had the biggest values. Of course, angbutter remains the item in two distinguished itemsets, containing more than one product.
By looking at lift, we see the two strongest relations are between plain bread and croissant. It is rather obvious since lift is a symmetrical measure (the same value for X given Y and Y given X). The lift of 1.03 means that these two products are 1.03 times more likely to be purchased together than comparing to purchases when they are assumed to be unrelated (class materials from Unsupervised Learning).
Two dig more dipper into rules. I will examine rules for two particular products - classic croissant and americano. The first one is one of the most purchased products, and americano is the most popular beverage. We should not be surprised since sugar snacks are often sold in a package with a hot beverage.
Measure by which I will asses the power of the rule is confidence.
rc<-apriori(data=bakery, parameter=list(supp=0.001,conf = 0.2),
appearance=list(default="lhs", rhs="croissant"), control=list(verbose=F))
rcb<-sort(rc, by="confidence", decreasing=TRUE)
inspect(head(rcb))
## lhs rhs support confidence coverage lift count
## [1] {orange.pound,
## pandoro,
## plain.bread,
## wiener} => {croissant} 0.001239669 1.00 0.001239669 3.239625 3
## [2] {angbutter,
## orange.pound,
## pandoro,
## plain.bread,
## wiener} => {croissant} 0.001239669 1.00 0.001239669 3.239625 3
## [3] {almond.croissant,
## angbutter,
## jam,
## pain.au.chocolat} => {croissant} 0.001652893 0.80 0.002066116 2.591700 4
## [4] {almond.croissant,
## angbutter,
## pandoro,
## plain.bread} => {croissant} 0.001652893 0.80 0.002066116 2.591700 4
## [5] {almond.croissant,
## angbutter,
## pain.au.chocolat,
## tiramisu.croissant} => {croissant} 0.001652893 0.80 0.002066116 2.591700 4
## [6] {almond.croissant,
## cheese.cake,
## pain.au.chocolat} => {croissant} 0.001239669 0.75 0.001652893 2.429719 3
rc<-apriori(data=bakery, parameter=list(supp=0.001,conf = 0.2),
appearance=list(default="rhs", lhs="croissant"), control=list(verbose=F))
rcb<-sort(rc, by="confidence", decreasing=TRUE)
inspect(head(rcb))
## lhs rhs support confidence coverage
## [1] {} => {angbutter} 0.81528926 0.8152893 1.0000000
## [2] {croissant} => {angbutter} 0.23057851 0.7469880 0.3086777
## [3] {croissant} => {plain.bread} 0.11363636 0.3681392 0.3086777
## [4] {} => {plain.bread} 0.35413223 0.3541322 1.0000000
## [5] {} => {tiramisu.croissant} 0.32190083 0.3219008 1.0000000
## [6] {croissant} => {pain.au.chocolat} 0.09586777 0.3105756 0.3086777
## lift count
## [1] 1.0000000 1973
## [2] 0.9162245 558
## [3] 1.0395530 275
## [4] 1.0000000 857
## [5] 1.0000000 779
## [6] 1.2803970 232
As we see, there is much more orders, which assumes buying croissants in the first place. Moreover, croissant in this case is a part of much complex orders, in which clients purchased also different type of croissants like almond or tiramisu. The biggest confidence value was reached for basket with orange.pound, pandoro and wiener.
Taking croissant as Antecedent, the strongest relations are with most popular items (plain bread and angbutter), and also pain.au.chocolat.
americano<-apriori(data=bakery, parameter=list(supp=0.001,conf = 0.2),
appearance=list(default="lhs", rhs="americano"), control=list(verbose=F))
ra<-sort(americano, by="confidence", decreasing=TRUE)
inspect(head(ra))
## lhs rhs support confidence coverage lift count
## [1] {berry.ade,
## caffe.latte,
## tiramisu.croissant} => {americano} 0.001239669 1.0000000 0.001239669 5.873786 3
## [2] {angbutter,
## berry.ade,
## caffe.latte,
## tiramisu.croissant} => {americano} 0.001239669 1.0000000 0.001239669 5.873786 3
## [3] {angbutter,
## berry.ade,
## caffe.latte} => {americano} 0.002066116 0.8333333 0.002479339 4.894822 5
## [4] {berry.ade,
## caffe.latte} => {americano} 0.002479339 0.6666667 0.003719008 3.915858 6
## [5] {lemonade,
## orange.pound} => {americano} 0.001239669 0.6000000 0.002066116 3.524272 3
## [6] {berry.ade,
## cacao.deep} => {americano} 0.001239669 0.6000000 0.002066116 3.524272 3
americano1<-apriori(data=bakery, parameter=list(supp=0.001,conf = 0.2),
appearance=list(default="rhs", lhs="americano"), control=list(verbose=F))
ra1<-sort(americano1, by="confidence", decreasing=TRUE)
inspect(head(ra1))
## lhs rhs support confidence coverage
## [1] {} => {angbutter} 0.81528926 0.8152893 1.0000000
## [2] {americano} => {angbutter} 0.13471074 0.7912621 0.1702479
## [3] {} => {plain.bread} 0.35413223 0.3541322 1.0000000
## [4] {} => {tiramisu.croissant} 0.32190083 0.3219008 1.0000000
## [5] {} => {croissant} 0.30867769 0.3086777 1.0000000
## [6] {americano} => {plain.bread} 0.04710744 0.2766990 0.1702479
## lift count
## [1] 1.0000000 1973
## [2] 0.9705293 326
## [3] 1.0000000 857
## [4] 1.0000000 779
## [5] 1.0000000 747
## [6] 0.7813438 114
Americano as Consequent is bought together with other drinks, especially berry ade and caffe latte. When it comes to sweet snacks, tiramisu croissant and angbutter are the ones after which clients ordered americano.
If they decided to buy americano in the first place, the most frequently bought snack was angbutter. Another product along which americano was bought was plain bread.
#Conclusions
In this short paper I examined two association rules algorithms - eclast and apriori on the dataset of bakery sales. From the analysis I pointed out products (items) that occurred the biggest number of times, and due to their values of mostly used measures, they could explain a majority of behavior patterns in sold orders. Aligning bakery’s offer to the results may contribute to achieving higher revenue for delivery orders.
https://www.kaggle.com/hosubjeong/bakery-sales https://pub.towardsai.net/association-discovery-the-apriori-algorithm-28c1e71e0f04 https://select-statistics.co.uk/blog/market-basket-analysis-understanding-customer-behaviour/ Class Materials on Unsupervised Learning, Faculty of Economic Sciences, University of Warsaw