Association rules

Introduction
Dataset
Association rules
Conclusions

Introduction

The key to every successful business is the understanding of it’s customers. In order to increase sales and profit companies has to discover what do their clients like, how much money are they able to spend on offered products and which product categories are they willing to buy together in a single transaction.

Source: londonist.com

Association rules are very helpful when it comes to mining for patterns in customers behavior. Association rules are if - then statements, that describe the relationships between data items. For example if customer buys a scarf he will probably also look for gloves. In this example scarf is the rule antecedent and gloves are the rule consequent.

The aim of this paper is to mine for association rules in dataset with transactions from bakery and to check whether those rules found in morning transactions differ from afternoon ones.

Dataset

Description:

Dataset used in this study consists of 9192 transactions from “The Bread Basket” bakery located in Edinburgh. It is available on kaggle website (https://www.kaggle.com/mittalvasu95/the-bread-basket). The dataset has been split into two separate frames: morning transactions (4103) and afternoon transactions (5089).

Data analysis:

summary(df_morning)

## transactions as itemMatrix in sparse format with
##  4103 rows (elements/itemsets/transactions) and
##  76 columns (items) and a density of 0.02468348 
## 
## most frequent items:
##    Coffee     Bread    Pastry       Tea Medialuna   (Other) 
##      2113      1490       572       441       380      2701 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    9 
## 1836 1345  629  214   55   17    6    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.876   2.000   9.000 
## 
## includes extended item information - examples:
##                     labels
## 1 Afternoon with the baker
## 2                Alfajores
## 3          Argentina Night

As expected, the most frequent items people buy in the morning are coffee, bread and pastry. Over 75% of transactions consist of only 1 or 2 items.

itemFrequencyPlot(df_morning, topN=15, type="relative", main="Morning transactions - Item frequency", col="#56B4E9")

Coffee appears in more than a half of transactions made in the morning.

summary(df_afternoon)

## transactions as itemMatrix in sparse format with
##  5089 rows (elements/itemsets/transactions) and
##  86 columns (items) and a density of 0.02441883 
## 
## most frequent items:
##   Coffee    Bread      Tea     Cake Sandwich  (Other) 
##     2340     1556      864      696      590     4641 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9 
## 1965 1644  816  429  171   45   11    4    4 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     1.0     2.0     2.1     3.0     9.0 
## 
## includes extended item information - examples:
##                     labels
## 1 Afternoon with the baker
## 2                Alfajores
## 3          Argentina Night

In the afternoon coffee and bread are still the most popular items, 3rd place belongs to the tea and pastry is not even in top 5.

itemFrequencyPlot(df_afternoon, topN=15, type="relative", main="Afternoon transactions - Item frequency", col="#009E73")

Coffee appears appears in more than 45% of transactions made in the afternoon.

There are 3 main measures used when it comes to mining for association rules:

Support

\({support}(X) = \frac{{count}(x)}{N}\)

It shows how frequent an itemset or rule occurs in the dataset.
Confidence

\({confidence}(X \to Y) = \frac{{support}(X,Y)}{{support}(X)}\)

It shows the percentage of transactions in which presence of one item or itemset results in the presence of another item or itemset.
Lift

\({lift}(X \to Y) = \frac{{confidence}(X \to Y)}{{support}(Y)}\)

It shows the rise in probability of having item Y on the cart with the knowledge of item X being present over the probability of having item Y on the cart without any knowledge about presence of X. If lift is greater than 1 then there is a positive association between those two items or itemsets. If it’s close to 1, items or itemsets are independent. Value lower than one means that there is a negative association.

Apriori

Apriori algorithm identifies the frequent individual items in the database and extend them to larger and larger item sets as long as those item sets appear sufficiently often in the database.

Firstly, the Apriori algorithm has been used with default values (minimum support = 0.1, minimum confidence = 0.8).

rules_morning<-apriori(df_morning)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 410 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[76 item(s), 4103 transaction(s)] done [0.00s].
## sorting and recoding items ... [4 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules_afternoon<-apriori(df_afternoon)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 508 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[86 item(s), 5089 transaction(s)] done [0.00s].
## sorting and recoding items ... [5 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Unfortunately, the algorithm hasn’t found any rules for both sets of transactions.

In order to find any rules in analyzed datasets, thresholds of minimum support and minimum confidence had to be lowered. Their values have been set to 0.01 and 0.5 respectively.

rules_morning<-apriori(df_morning, parameter=list(supp=0.01, conf=0.5, minlen=2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 41 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[76 item(s), 4103 transaction(s)] done [0.00s].
## sorting and recoding items ... [23 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [12 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules_afternoon<-apriori(df_afternoon, parameter=list(supp=0.01, conf=0.5, minlen=2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 50 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[86 item(s), 5089 transaction(s)] done [0.00s].
## sorting and recoding items ... [32 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [9 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Lowering minimum support and confidence values resulted in finding 12 rules for morning transactions and 9 rules for afternoon ones.

plot_morning_rules <- plot(rules_morning, measure=c("support","lift"), shading="confidence", main="Morning transactions rules")

plot(rules_morning, method="graph")

plot_afternoon_rules <- plot(rules_afternoon, measure=c("support","lift"), shading="confidence", main="Afternoon transactions rules")

plot(rules_afternoon, method="graph")

Rules with the highest values of support, confidence and lift have been displayed below. Not surprisingly, coffee appears in all of displayed rules as it was present in over half of morning transactions and in almost half of afternoon transactions.

inspect(sort(rules_morning, by = "support")[1:3], linebreak = FALSE)

##     lhs            rhs      support    confidence coverage   lift     count
## [1] {Pastry}    => {Coffee} 0.07726054 0.5541958  0.13941019 1.076131 317  
## [2] {Medialuna} => {Coffee} 0.05459420 0.5894737  0.09261516 1.144633 224  
## [3] {Toast}     => {Coffee} 0.03582744 0.7205882  0.04971972 1.399230 147

Coffee is the most popular item in te bakery. Inthe morning, most often it appears together with pastry (317 transactions). Other rules with the highest support indicate that coffee is being willingly bought with medialuna (5.5% of all transactions) and toast (3.6% of all transactions).

inspect(sort(rules_morning, by = "confidence")[1:3], linebreak = FALSE)

##     lhs          rhs      support    confidence coverage   lift     count
## [1] {Toast}   => {Coffee} 0.03582744 0.7205882  0.04971972 1.399230 147  
## [2] {Juice}   => {Coffee} 0.01949793 0.6106870  0.03192786 1.185825  80  
## [3] {Cookies} => {Coffee} 0.02851572 0.6000000  0.04752620 1.165073 117

Rules for morning transactions with the highest confidence show that if customer buys toast, juice or cookies, he will also buy coffee with the probability of 72%, 61% and 60% respectively.

inspect(sort(rules_morning, by = "lift")[1:3], linebreak = FALSE)

##     lhs          rhs      support    confidence coverage   lift     count
## [1] {Toast}   => {Coffee} 0.03582744 0.7205882  0.04971972 1.399230 147  
## [2] {Juice}   => {Coffee} 0.01949793 0.6106870  0.03192786 1.185825  80  
## [3] {Cookies} => {Coffee} 0.02851572 0.6000000  0.04752620 1.165073 117

All 3 rules for morning transactions with the highest lift have lift value exceeding 1. It means that coffee is more likely to appear in transaction together with toast, juice or cookies than separately.

inspect(sort(rules_afternoon, by = "support")[1:3], linebreak = FALSE)

##     lhs           rhs      support    confidence coverage   lift     count
## [1] {Cake}     => {Coffee} 0.07191983 0.5258621  0.13676557 1.143638 366  
## [2] {Sandwich} => {Coffee} 0.06229122 0.5372881  0.11593633 1.168487 317  
## [3] {Pastry}   => {Coffee} 0.02554529 0.5579399  0.04578503 1.213400 130

Rules for afternoon transactions with highest value of support differ a bit from those made in the morning. Customers buy coffee most frequently with cake (7.2% of all transactions), sandwich (6.2% of all transactions) and pastry (2.6% of all transactions vs 7.7% in the morning).

inspect(sort(rules_afternoon, by = "confidence")[1:3], linebreak = FALSE)

##     lhs         rhs      support    confidence coverage   lift     count
## [1] {Toast}  => {Coffee} 0.01513067 0.6754386  0.02240126 1.468935  77  
## [2] {Salad}  => {Coffee} 0.01120063 0.6404494  0.01748870 1.392841  57  
## [3] {Pastry} => {Coffee} 0.02554529 0.5579399  0.04578503 1.213400 130

Just as in case of morning transactions, rule for afternoon transactions with the highest confidence indicates that if customer buys toast, he will also probably buy coffee (with the probability of 68%). Second and third rule concerns buying coffee, if customer has already picked salad (64% probability) and pastry (56%).

inspect(sort(rules_afternoon, by = "lift")[1:3], linebreak = FALSE)

##     lhs         rhs      support    confidence coverage   lift     count
## [1] {Toast}  => {Coffee} 0.01513067 0.6754386  0.02240126 1.468935  77  
## [2] {Salad}  => {Coffee} 0.01120063 0.6404494  0.01748870 1.392841  57  
## [3] {Pastry} => {Coffee} 0.02554529 0.5579399  0.04578503 1.213400 130

Just as in case of morning transactions, all 3 rules for afternoon transactions with the highest lift have lift value exceeding 1. In this case it means that coffee is more likely to appear in transaction together with toast, salad or pastry than separately.

Examples of mining for other rules

It possible to look for association rules based on a specific product. Some examples can be found below:

What else do people buy in the afternoon if they already picked sandwich?

rules_sandwich<-apriori(data=df_afternoon, parameter=list(supp=0.01,conf = 0.005, minlen=2), appearance=list(default="rhs", lhs="Sandwich"), control=list(verbose=F)) 
rules_sandwich_byconf<-sort(rules_sandwich, by="confidence", decreasing=TRUE)
inspect((rules_sandwich_byconf)[2], linebreak = FALSE)

##     lhs           rhs     support    confidence coverage  lift      count
## [1] {Sandwich} => {Bread} 0.02652781 0.2288136  0.1159363 0.7483497 135

There is quite big possibility that people will buy bread if they already have sandwich in their basket. This is the second most popular choice of sandwich lovers just after coffee.

Why do people buy tea in the morning?

rules_tea<-apriori(data=df_morning, parameter=list(supp=0.01,conf = 0.005, minlen=2), appearance=list(default="lhs", rhs="Tea"), control=list(verbose=F)) 
rules_tea_byconf<-sort(rules_tea, by="confidence", decreasing=TRUE)
inspect((rules_tea_byconf)[1], linebreak = FALSE)

##     lhs       rhs   support    confidence coverage   lift     count
## [1] {Cake} => {Tea} 0.01218621 0.1930502  0.06312454 1.796111 50

Result shows that people tend to buy tea in the morning if they already have cake in their basket.

What are people likely to buy with coffee in the morning and in the afternoon?

In the morning:

rules_coffee<-apriori(data=df_morning, parameter=list(supp=0.01,conf = 0.005, minlen=2), appearance=list(default="rhs", lhs="Coffee"), control=list(verbose=F)) 
rules_coffee_byconf<-sort(rules_coffee, by="confidence", decreasing=TRUE)
inspect((rules_coffee_byconf)[1:2], linebreak = FALSE)

##     lhs         rhs      support    confidence coverage lift      count
## [1] {Coffee} => {Bread}  0.09407750 0.1826787  0.514989 0.5030406 386  
## [2] {Coffee} => {Pastry} 0.07726054 0.1500237  0.514989 1.0761313 317

In the afternoon:

rules_coffee<-apriori(data=df_afternoon, parameter=list(supp=0.01,conf = 0.005, minlen=2), appearance=list(default="rhs", lhs="Coffee"), control=list(verbose=F)) 
rules_coffee_byconf<-sort(rules_coffee, by="confidence", decreasing=TRUE)
inspect((rules_coffee_byconf)[1:2], linebreak = FALSE)

##     lhs         rhs     support    confidence coverage  lift      count
## [1] {Coffee} => {Bread} 0.08999803 0.1957265  0.4598153 0.6401363 458  
## [2] {Coffee} => {Cake}  0.07191983 0.1564103  0.4598153 1.1436376 366

In the morning as well as in the afternoon customers tend to buy bread if they already decided to buy coffee. The second most popular choice in the morning is pastry and cake in the afternoon.

Conclusions

Association rules allow us to find interesting patterns in customers preferences. In this study the dataset from bakery located in Edinburgh has been analyzed. The most popular item in this store is coffee. It’s so popular that is has dominated all found rules. Conducted analysis compares also transactions made in the morning and in the afternoon. There are some visible differences between those two parts of a day.

Association rules - analysis of transactions from bakery

Wojciech Konarzewski

16.01.2021

Introduction

Dataset

Description:

Data analysis: