1 Introduction

Hi! In this kernel we are going to use the Apriori algorithm to perform a Market Basket Analysis. A Market what? Is a technique used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions, providing information to understand the purchase behavior. The outcome of this type of technique is, in simple terms, a set of rules that can be understood as “if this, then that”. For more information about these topics, please check in the following links:

First it’s important to define the Apriori algorithm, including some statistical concepts (support, confidence, lift and conviction) to select interesting rules. Then we are going to use a data set containing more than 6.000 transactions from a bakery to apply the algorithm and find combinations of products that are bought together. Let’s start!

2 Association rules

The Apriori algorithm generates association rules for a given data set. An association rule implies that if an item A occurs, then item B also occurs with a certain probability. Let’s see an example,

Transaction Items
t1 {T-shirt, Trousers, Belt}
t2 {T-shirt, Jacket}
t3 {Jacket, Gloves}
t4 {T-shirt, Trousers, Jacket}
t5 {T-shirt, Trousers, Sneakers, Jacket, Belt}
t6 {Trousers, Sneakers, Belt}
t7 {Trousers, Belt, Sneakers}

In the table above we can see seven transactions from a clothing store. Each transaction shows items bought in that transaction. We can represent our items as an item set as follows:

\[I=\{i_1, i_2,..., i_k\}\]

In our case it corresponds to:

\[I=\{T\text- shirt, Trousers, Belt, Jacket, Gloves, Sneakers\}\]

A transaction is represented by the following expression:

\[T=\{t_1, t_2,..., t_n\}\]

For example,

\[t_1=\{T\text- shirt, Trousers, Belt\}\]

Then, an association rule is defined as an implication of the form:

\(X \Rightarrow Y\), where \(X \subset I\), \(Y \subset I\) and \(X \cap Y = 0\)

For example,

\[\{T\text- shirt, Trousers\} \Rightarrow \{Belt\}\]

In the following sections we are going to define four metrics to measure the precision of a rule.

2.1 Support

Support is an indication of how frequently the item set appears in the data set.

\[supp(X \Rightarrow Y)=\dfrac{|X \cup Y|}{n}\]

In other words, it’s the number of transactions with both \(X\) and \(Y\) divided by the total number of transactions. The rules are not useful for low support values. Let’s see different examples using the clothing store transactions from the previous table.

  • \(supp(T\text- shirt \Rightarrow Trousers)=\dfrac{3}{7}=43 \%\)

  • \(supp(Trousers \Rightarrow Belt)=\dfrac{4}{7}= 57 \%\)

  • \(supp(T\text- shirt \Rightarrow Belt)=\dfrac{2}{7}=28 \%\)

  • \(supp(\{T\text- shirt, Trousers\} \Rightarrow \{Belt\})=\dfrac{2}{7}=28 \%\)

2.2 Confidence

For a rule \(X \Rightarrow Y\), confidence shows the percentage in which \(Y\) is bought with \(X\). It’s an indication of how often the rule has been found to be true.

\[conf(X \Rightarrow Y)=\dfrac{supp(X \cup Y)}{supp(X)}\]

For example, the rule \(T\text- shirt \Rightarrow Trousers\) has a confidence of 3/4, which means that for 75% of the transactions containing a t-shirt the rule is correct (75% of the times a customer buys a t-shirt, trousers are bought as well). Three more examples:

  • \(conf(Trousers \Rightarrow Belt)=\dfrac{4/7}{5/7}= 80 \%\)

  • \(conf(T\text- shirt \Rightarrow Belt)=\dfrac{2/7}{4/7}=50 \%\)

  • \(conf(\{T\text- shirt, Trousers\} \Rightarrow \{Belt\})=\dfrac{2/7}{3/7}=66 \%\)

2.3 Lift

The lift of a rule is the ratio of the observed support to that expected if \(X\) and \(Y\) were independent, and is defined as

\[lift(X \Rightarrow Y)=\dfrac{supp(X \cup Y)}{supp(X)supp(Y) }\]

Greater lift values indicate stronger associations. Let’s see some examples:

  • \(lift(T\text- shirt \Rightarrow Trousers)=\dfrac{3/7}{(4/7)(5/7)}= 1.05\)

  • \(lift(Trousers \Rightarrow Belt)=\dfrac{4/7}{(5/7)(4/7)}= 1.4\)

  • \(lift(T\text- shirt \Rightarrow Belt)=\dfrac{2/7}{(4/7)(4/7)}=0.875\)

  • \(lift(\{T\text- shirt, Trousers\} \Rightarrow \{Belt\})=\dfrac{2/7}{(3/7)(4/7)}=1.17\)

2.4 Conviction

The conviction of a rule is defined as

\[conv(X \Rightarrow Y)=\dfrac{1-supp(Y)}{1-conf(X \Rightarrow Y) }\]

It can be interpreted as the ratio of the expected frequency that \(X\) occurs without \(Y\) if \(X\) and \(Y\) were independent divided by the observed frequency of incorrect predictions. A high value means that the consequent depends strongly on the antecedent. Let’s see some examples:

  • \(conv(T\text- shirt \Rightarrow Trousers)= \dfrac{1-5/7}{1-3/4}=1.14\)

  • \(conv(Trousers \Rightarrow Belt)= \dfrac{1-4/7}{1-4/5}=2.14\)

  • \(conv(T\text- shirt \Rightarrow Belt)=\dfrac{1-4/7}{1-1/2}=0.86\)

  • \(conv(\{T\text- shirt, Trousers\} \Rightarrow \{Belt\})=\dfrac{1-4/7}{1-2/3}=1.28\)

If you want more information about these measures, please check here.

3 Loading Data

First we need to load some libraries and import our data. We can use the function read.transactions() from the arules package to create a transactions object.

Let’s get an idea of what we’re working with.

3.1 Transaction object

## transactions in sparse format with
##  6614 transactions (rows) and
##  104 items (columns)

3.2 Summary

## transactions as itemMatrix in sparse format with
##  6614 rows (elements/itemsets/transactions) and
##  104 columns (items) and a density of 0.02008705 
## 
## most frequent items:
##  Coffee   Bread     Tea    Cake  Pastry (Other) 
##    3188    2146     941     694     576    6272 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10 
## 2556 2154 1078  546  187   67   18    3    2    3 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.089   3.000  10.000 
## 
## includes extended item information - examples:
##                     labels
## 1               Adjustment
## 2 Afternoon with the baker
## 3                Alfajores
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2            10
## 3          1000

3.3 Glimpse

## Formal class 'transactions' [package "arules"] with 3 slots
##   ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
##   ..@ itemInfo   :'data.frame':  104 obs. of  1 variable:
##   .. ..$ labels: chr [1:104] "Adjustment" "Afternoon with the baker" "Alfajores" "Argentina Night" ...
##   ..@ itemsetInfo:'data.frame':  6614 obs. of  1 variable:
##   .. ..$ transactionID: chr [1:6614] "1" "10" "1000" "1001" ...

3.4 Structure

## Formal class 'transactions' [package "arules"] with 3 slots
##   ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
##   .. .. ..@ i       : int [1:13817] 11 63 80 19 80 11 93 14 45 11 ...
##   .. .. ..@ p       : int [1:6615] 0 1 3 5 7 9 11 15 16 17 ...
##   .. .. ..@ Dim     : int [1:2] 104 6614
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : NULL
##   .. .. .. ..$ : NULL
##   .. .. ..@ factors : list()
##   ..@ itemInfo   :'data.frame':  104 obs. of  1 variable:
##   .. ..$ labels: chr [1:104] "Adjustment" "Afternoon with the baker" "Alfajores" "Argentina Night" ...
##   ..@ itemsetInfo:'data.frame':  6614 obs. of  1 variable:
##   .. ..$ transactionID: chr [1:6614] "1" "10" "1000" "1001" ...

4 Data Dictionary

The data set contains 15.010 observations and the following columns,

5 Data Analysis

Before applying the Apriori algorithm on the data set, we are going to show some visualizations to learn more about the transactions. For example, we can use the itemFrequencyPlot() function to create an item frequency bar plot, in order to view the distribution of products.

The itemFrequencyPlot() allows us to show the absolute or relative values. If absolute it will plot numeric frequencies of each item independently. If relative it will plot how many times these items have appeared as compared to others, as it’s shown in the following plot.

Coffee is the best-selling product by far, followed by bread and tea. Let’s display some other visualizations describing the time distribution using the ggplot() function.

The data set includes dates from 30/10/2016 to 09/04/2017, that’s why we have so few transactions in October and April.

As we can see, Saturday is the busiest day in the bakery. Conversely, Wednesday is the day with fewer transactions.

There’s not much to discuss with this visualization. The results are logical and expected.

6 Apriori algorithm

6.1 Choice of support and confidence

The first step in order to create a set of association rules is to determine the optimal thresholds for support and confidence. If we set these values too low, then the algorithm will take longer to execute and we will get a lot of rules (most of them will not be useful). Then, what values do we choose? We can try different values of support and confidence and see graphically how many rules are generated for each combination.

In the following graphs we can see the number of rules generated with a support level of 10%, 5%, 1% and 0.5%.

We can join the four lines to improve the visualization.

Let’s analyze the results,

  • Support level of 10%. We only identify a few rules with very low confidence levels. This means that there are no relatively frequent associations in our data set. We can’t choose this value, the resulting rules are unrepresentative.

  • Support level of 5%. We only identify a rule with a confidence of at least 50%. It seems that we have to look for support levels below 5% to obtain a greater number of rules with a reasonable confidence.

  • Support level of 1%. We started to get dozens of rules, of which 13 have a confidence of at least 50%.

  • Support level of 0.5%. Too many rules to analyze!

To sum up, we are going to use a support level of 1% and a confidence level of 50%.

6.2 Execution

Let’s execute the Apriori algorithm with the values obtained in the previous section.

The generated association rules are the following,

##      lhs                 rhs      support    confidence coverage   lift    
## [1]  {Tiffin}         => {Coffee} 0.01058361 0.5468750  0.01935289 1.134577
## [2]  {Spanish Brunch} => {Coffee} 0.01406108 0.6326531  0.02222558 1.312537
## [3]  {Scone}          => {Coffee} 0.01844572 0.5422222  0.03401875 1.124924
## [4]  {Toast}          => {Coffee} 0.02570305 0.7296137  0.03522830 1.513697
## [5]  {Alfajores}      => {Coffee} 0.02237678 0.5522388  0.04052011 1.145705
## [6]  {Juice}          => {Coffee} 0.02131842 0.5300752  0.04021772 1.099723
## [7]  {Hot chocolate}  => {Coffee} 0.02721500 0.5263158  0.05170850 1.091924
## [8]  {Medialuna}      => {Coffee} 0.03296039 0.5751979  0.05730269 1.193337
## [9]  {Cookies}        => {Coffee} 0.02978530 0.5267380  0.05654672 1.092800
## [10] {NONE}           => {Coffee} 0.04172966 0.5810526  0.07181736 1.205484
## [11] {Sandwich}       => {Coffee} 0.04233444 0.5679513  0.07453886 1.178303
## [12] {Pastry}         => {Coffee} 0.04868461 0.5590278  0.08708800 1.159790
## [13] {Cake}           => {Coffee} 0.05654672 0.5389049  0.10492894 1.118042
##      count
## [1]   70  
## [2]   93  
## [3]  122  
## [4]  170  
## [5]  148  
## [6]  141  
## [7]  180  
## [8]  218  
## [9]  197  
## [10] 276  
## [11] 280  
## [12] 322  
## [13] 374

We can also create an HTML table widget using the inspectDT() function from the aruslesViz package. Rules can be interactively filtered and sorted.

How do we interpret these rules?

  • 52% of the customers who bought a hot chocolate algo bought a coffee.

  • 63% of the customers who bought a spanish brunch also bought a coffee.

  • 73% of the customers who bought a toast also bought a coffee.

And so on. It seems that in this bakery there are many coffee lovers!

6.3 Visualize association rules

We are going to use the arulesViz package to create the visualizations. Let’s begin with a simple scatter plot with different measures of interestingness on the axes (lift and support) and a third measure (confidence) represented by the color of the points.

The following visualization represents the rules as a graph with items as labeled vertices, and rules represented as vertices connected to items using arrows.

We can also change the graph layout.

What else can we do? We can represent the rules as a grouped matrix-based visualization. The support and lift measures are represented by the size and color of the ballons, respectively. In this case it’s not a very useful visualization, since we only have coffe on the right-hand-side of the rules.

There’s an awesome function called ruleExplorer() that explores association rules using interactive manipulations and visualization using shiny. Unfortunately, R Markdown still doesn’t support shiny app objects.

6.4 Another execution

We have executed the Apriori algorithm with the appropriate support and confidence values. What happens if we execute it with low values? How do the visualizations change? Let’s try with a support level of 0.5% and a confidence level of 10%.

It’s impossible to analyze these visualizations! For larger rule sets visual analysis becomes difficult. Furthermore, most of the rules are useless. That’s why we have to carefully select the right values of support and confidence.

7 Exercises

In this section you can test the concepts learned during this kernel by answering the following questionnaire. Good luck!

Have you passed the test?

8 Summary

In this kernel we have learned about the Apriori algorithm, one of the most frequently used algorithms in data mining. We have reviewed some statistical concepts (support, confidence, lift and conviction) to select interesting rules, we have chosen the appropriate values to execute the algorithm and finally we have visualized the resulting association rules.

And that’s it! It has been a pleasure to make this kernel, I have learned a lot! Thank you for reading and if you like it, please upvote it!

By the way, if you want to view more kernels about other machine learning algorithms or statistical techniques, you can check the following links:

9 Citations for used packages

Hadley Wickham (2017). tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse

Michael Hahsler, Christian Buchta, Bettina Gruen and Kurt Hornik (2018). arules: Mining Association Rules and Frequent Itemsets. R package version 1.6-1. https://CRAN.R-project.org/package=arules

Michael Hahsler, Bettina Gruen and Kurt Hornik (2005), arules - A Computational Environment for Mining Association Rules and Frequent Item Sets. Journal of Statistical Software 14/15. URL: http://dx.doi.org/10.18637/jss.v014.i15.

Michael Hahsler, Sudheer Chelluboina, Kurt Hornik, and Christian Buchta (2011), The arules R-package ecosystem: Analyzing interesting patterns from large transaction datasets. Journal of Machine Learning Research, 12:1977–1981. URL: http://jmlr.csail.mit.edu/papers/v12/hahsler11a.html.

Michael Hahsler (2018). arulesViz: Visualizing Association Rules and Frequent Itemsets. R package version 1.3-1. https://CRAN.R-project.org/package=arulesViz

Yihui Xie (2018). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.20.

Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595

Baptiste Auguie (2017). gridExtra: Miscellaneous Functions for “Grid” Graphics. R package version 2.3. https://CRAN.R-project.org/package=gridExtra