The key to every successful business is the understanding of it’s customers. In order to increase sales and profit companies has to discover what do their clients like, how much money are they able to spend on offered products and which product categories are they willing to buy together in a single transaction.
Source: londonist.com
Association rules are very helpful when it comes to mining for patterns in customers behavior. Association rules are if - then statements, that describe the relationships between data items. For example if customer buys a scarf he will probably also look for gloves. In this example scarf is the rule antecedent and gloves are the rule consequent.
The aim of this paper is to mine for association rules in dataset with transactions from bakery and to check whether those rules found in morning transactions differ from afternoon ones.
Dataset used in this study consists of 9192 transactions from “The Bread Basket” bakery located in Edinburgh. It is available on kaggle website (https://www.kaggle.com/mittalvasu95/the-bread-basket). The dataset has been split into two separate frames: morning transactions (4103) and afternoon transactions (5089).
summary(df_morning)
## transactions as itemMatrix in sparse format with
## 4103 rows (elements/itemsets/transactions) and
## 76 columns (items) and a density of 0.02468348
##
## most frequent items:
## Coffee Bread Pastry Tea Medialuna (Other)
## 2113 1490 572 441 380 2701
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 9
## 1836 1345 629 214 55 17 6 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 1.876 2.000 9.000
##
## includes extended item information - examples:
## labels
## 1 Afternoon with the baker
## 2 Alfajores
## 3 Argentina Night
As expected, the most frequent items people buy in the morning are coffee, bread and pastry. Over 75% of transactions consist of only 1 or 2 items.
itemFrequencyPlot(df_morning, topN=15, type="relative", main="Morning transactions - Item frequency", col="#56B4E9")
Coffee appears in more than a half of transactions made in the morning.
summary(df_afternoon)
## transactions as itemMatrix in sparse format with
## 5089 rows (elements/itemsets/transactions) and
## 86 columns (items) and a density of 0.02441883
##
## most frequent items:
## Coffee Bread Tea Cake Sandwich (Other)
## 2340 1556 864 696 590 4641
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9
## 1965 1644 816 429 171 45 11 4 4
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 1.0 2.0 2.1 3.0 9.0
##
## includes extended item information - examples:
## labels
## 1 Afternoon with the baker
## 2 Alfajores
## 3 Argentina Night
In the afternoon coffee and bread are still the most popular items, 3rd place belongs to the tea and pastry is not even in top 5.
itemFrequencyPlot(df_afternoon, topN=15, type="relative", main="Afternoon transactions - Item frequency", col="#009E73")
Coffee appears appears in more than 45% of transactions made in the afternoon.
There are 3 main measures used when it comes to mining for association rules:
Support
\({support}(X) = \frac{{count}(x)}{N}\)
It shows how frequent an itemset or rule occurs in the dataset.
Confidence
\({confidence}(X \to Y) = \frac{{support}(X,Y)}{{support}(X)}\)
It shows the percentage of transactions in which presence of one item or itemset results in the presence of another item or itemset.
Lift
\({lift}(X \to Y) = \frac{{confidence}(X \to Y)}{{support}(Y)}\)
It shows the rise in probability of having item Y on the cart with the knowledge of item X being present over the probability of having item Y on the cart without any knowledge about presence of X. If lift is greater than 1 then there is a positive association between those two items or itemsets. If it’s close to 1, items or itemsets are independent. Value lower than one means that there is a negative association.
Apriori algorithm identifies the frequent individual items in the database and extend them to larger and larger item sets as long as those item sets appear sufficiently often in the database.
Firstly, the Apriori algorithm has been used with default values (minimum support = 0.1, minimum confidence = 0.8).
rules_morning<-apriori(df_morning)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 410
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[76 item(s), 4103 transaction(s)] done [0.00s].
## sorting and recoding items ... [4 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules_afternoon<-apriori(df_afternoon)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 508
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[86 item(s), 5089 transaction(s)] done [0.00s].
## sorting and recoding items ... [5 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Unfortunately, the algorithm hasn’t found any rules for both sets of transactions.
In order to find any rules in analyzed datasets, thresholds of minimum support and minimum confidence had to be lowered. Their values have been set to 0.01 and 0.5 respectively.
rules_morning<-apriori(df_morning, parameter=list(supp=0.01, conf=0.5, minlen=2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.01 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 41
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[76 item(s), 4103 transaction(s)] done [0.00s].
## sorting and recoding items ... [23 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [12 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules_afternoon<-apriori(df_afternoon, parameter=list(supp=0.01, conf=0.5, minlen=2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.01 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 50
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[86 item(s), 5089 transaction(s)] done [0.00s].
## sorting and recoding items ... [32 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [9 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Lowering minimum support and confidence values resulted in finding 12 rules for morning transactions and 9 rules for afternoon ones.
plot_morning_rules <- plot(rules_morning, measure=c("support","lift"), shading="confidence", main="Morning transactions rules")
plot(rules_morning, method="graph")
plot_afternoon_rules <- plot(rules_afternoon, measure=c("support","lift"), shading="confidence", main="Afternoon transactions rules")
plot(rules_afternoon, method="graph")
Rules with the highest values of support, confidence and lift have been displayed below. Not surprisingly, coffee appears in all of displayed rules as it was present in over half of morning transactions and in almost half of afternoon transactions.
inspect(sort(rules_morning, by = "support")[1:3], linebreak = FALSE)
## lhs rhs support confidence coverage lift count
## [1] {Pastry} => {Coffee} 0.07726054 0.5541958 0.13941019 1.076131 317
## [2] {Medialuna} => {Coffee} 0.05459420 0.5894737 0.09261516 1.144633 224
## [3] {Toast} => {Coffee} 0.03582744 0.7205882 0.04971972 1.399230 147
Coffee is the most popular item in te bakery. Inthe morning, most often it appears together with pastry (317 transactions). Other rules with the highest support indicate that coffee is being willingly bought with medialuna (5.5% of all transactions) and toast (3.6% of all transactions).
inspect(sort(rules_morning, by = "confidence")[1:3], linebreak = FALSE)
## lhs rhs support confidence coverage lift count
## [1] {Toast} => {Coffee} 0.03582744 0.7205882 0.04971972 1.399230 147
## [2] {Juice} => {Coffee} 0.01949793 0.6106870 0.03192786 1.185825 80
## [3] {Cookies} => {Coffee} 0.02851572 0.6000000 0.04752620 1.165073 117
Rules for morning transactions with the highest confidence show that if customer buys toast, juice or cookies, he will also buy coffee with the probability of 72%, 61% and 60% respectively.
inspect(sort(rules_morning, by = "lift")[1:3], linebreak = FALSE)
## lhs rhs support confidence coverage lift count
## [1] {Toast} => {Coffee} 0.03582744 0.7205882 0.04971972 1.399230 147
## [2] {Juice} => {Coffee} 0.01949793 0.6106870 0.03192786 1.185825 80
## [3] {Cookies} => {Coffee} 0.02851572 0.6000000 0.04752620 1.165073 117
All 3 rules for morning transactions with the highest lift have lift value exceeding 1. It means that coffee is more likely to appear in transaction together with toast, juice or cookies than separately.
inspect(sort(rules_afternoon, by = "support")[1:3], linebreak = FALSE)
## lhs rhs support confidence coverage lift count
## [1] {Cake} => {Coffee} 0.07191983 0.5258621 0.13676557 1.143638 366
## [2] {Sandwich} => {Coffee} 0.06229122 0.5372881 0.11593633 1.168487 317
## [3] {Pastry} => {Coffee} 0.02554529 0.5579399 0.04578503 1.213400 130
Rules for afternoon transactions with highest value of support differ a bit from those made in the morning. Customers buy coffee most frequently with cake (7.2% of all transactions), sandwich (6.2% of all transactions) and pastry (2.6% of all transactions vs 7.7% in the morning).
inspect(sort(rules_afternoon, by = "confidence")[1:3], linebreak = FALSE)
## lhs rhs support confidence coverage lift count
## [1] {Toast} => {Coffee} 0.01513067 0.6754386 0.02240126 1.468935 77
## [2] {Salad} => {Coffee} 0.01120063 0.6404494 0.01748870 1.392841 57
## [3] {Pastry} => {Coffee} 0.02554529 0.5579399 0.04578503 1.213400 130
Just as in case of morning transactions, rule for afternoon transactions with the highest confidence indicates that if customer buys toast, he will also probably buy coffee (with the probability of 68%). Second and third rule concerns buying coffee, if customer has already picked salad (64% probability) and pastry (56%).
inspect(sort(rules_afternoon, by = "lift")[1:3], linebreak = FALSE)
## lhs rhs support confidence coverage lift count
## [1] {Toast} => {Coffee} 0.01513067 0.6754386 0.02240126 1.468935 77
## [2] {Salad} => {Coffee} 0.01120063 0.6404494 0.01748870 1.392841 57
## [3] {Pastry} => {Coffee} 0.02554529 0.5579399 0.04578503 1.213400 130
Just as in case of morning transactions, all 3 rules for afternoon transactions with the highest lift have lift value exceeding 1. In this case it means that coffee is more likely to appear in transaction together with toast, salad or pastry than separately.
It possible to look for association rules based on a specific product. Some examples can be found below:
rules_sandwich<-apriori(data=df_afternoon, parameter=list(supp=0.01,conf = 0.005, minlen=2), appearance=list(default="rhs", lhs="Sandwich"), control=list(verbose=F))
rules_sandwich_byconf<-sort(rules_sandwich, by="confidence", decreasing=TRUE)
inspect((rules_sandwich_byconf)[2], linebreak = FALSE)
## lhs rhs support confidence coverage lift count
## [1] {Sandwich} => {Bread} 0.02652781 0.2288136 0.1159363 0.7483497 135
There is quite big possibility that people will buy bread if they already have sandwich in their basket. This is the second most popular choice of sandwich lovers just after coffee.
rules_tea<-apriori(data=df_morning, parameter=list(supp=0.01,conf = 0.005, minlen=2), appearance=list(default="lhs", rhs="Tea"), control=list(verbose=F))
rules_tea_byconf<-sort(rules_tea, by="confidence", decreasing=TRUE)
inspect((rules_tea_byconf)[1], linebreak = FALSE)
## lhs rhs support confidence coverage lift count
## [1] {Cake} => {Tea} 0.01218621 0.1930502 0.06312454 1.796111 50
Result shows that people tend to buy tea in the morning if they already have cake in their basket.
In the morning:
rules_coffee<-apriori(data=df_morning, parameter=list(supp=0.01,conf = 0.005, minlen=2), appearance=list(default="rhs", lhs="Coffee"), control=list(verbose=F))
rules_coffee_byconf<-sort(rules_coffee, by="confidence", decreasing=TRUE)
inspect((rules_coffee_byconf)[1:2], linebreak = FALSE)
## lhs rhs support confidence coverage lift count
## [1] {Coffee} => {Bread} 0.09407750 0.1826787 0.514989 0.5030406 386
## [2] {Coffee} => {Pastry} 0.07726054 0.1500237 0.514989 1.0761313 317
In the afternoon:
rules_coffee<-apriori(data=df_afternoon, parameter=list(supp=0.01,conf = 0.005, minlen=2), appearance=list(default="rhs", lhs="Coffee"), control=list(verbose=F))
rules_coffee_byconf<-sort(rules_coffee, by="confidence", decreasing=TRUE)
inspect((rules_coffee_byconf)[1:2], linebreak = FALSE)
## lhs rhs support confidence coverage lift count
## [1] {Coffee} => {Bread} 0.08999803 0.1957265 0.4598153 0.6401363 458
## [2] {Coffee} => {Cake} 0.07191983 0.1564103 0.4598153 1.1436376 366
In the morning as well as in the afternoon customers tend to buy bread if they already decided to buy coffee. The second most popular choice in the morning is pastry and cake in the afternoon.
Association rules allow us to find interesting patterns in customers preferences. In this study the dataset from bakery located in Edinburgh has been analyzed. The most popular item in this store is coffee. It’s so popular that is has dominated all found rules. Conducted analysis compares also transactions made in the morning and in the afternoon. There are some visible differences between those two parts of a day.