# Load libraries
library(tidyverse)
library(arules)
library(arulesViz)
library(knitr)
library(gridExtra)
library(lubridate)
setwd("C:\\Users\\hi\\Documents")
# Read the data
trans <- read.transactions("BreadBasket_DMS.csv", format="single", cols=c(3,4), sep=",", rm.duplicates=TRUE)

Let’s get an idea of what we’re working with.

0.1 Transaction object

## transactions in sparse format with
##  6614 transactions (rows) and
##  104 items (columns)

0.2 Summary

## transactions as itemMatrix in sparse format with
##  6614 rows (elements/itemsets/transactions) and
##  104 columns (items) and a density of 0.02008705 
## 
## most frequent items:
##  Coffee   Bread     Tea    Cake  Pastry (Other) 
##    3188    2146     941     694     576    6272 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10 
## 2556 2154 1078  546  187   67   18    3    2    3 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.089   3.000  10.000 
## 
## includes extended item information - examples:
##                     labels
## 1               Adjustment
## 2 Afternoon with the baker
## 3                Alfajores
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2            10
## 3          1000

0.3 Glimpse

## Formal class 'transactions' [package "arules"] with 3 slots
##   ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
##   ..@ itemInfo   :'data.frame':  104 obs. of  1 variable:
##   .. ..$ labels: chr [1:104] "Adjustment" "Afternoon with the baker" "Alfajores" "Argentina Night" ...
##   ..@ itemsetInfo:'data.frame':  6614 obs. of  1 variable:
##   .. ..$ transactionID: Factor w/ 6614 levels "1","10","1000",..: 1 2 3 4 5 6 7 8 9 10 ...

0.4 Structure

## Formal class 'transactions' [package "arules"] with 3 slots
##   ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
##   .. .. ..@ i       : int [1:13817] 11 63 80 19 80 11 93 14 45 11 ...
##   .. .. ..@ p       : int [1:6615] 0 1 3 5 7 9 11 15 16 17 ...
##   .. .. ..@ Dim     : int [1:2] 104 6614
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : NULL
##   .. .. .. ..$ : NULL
##   .. .. ..@ factors : list()
##   ..@ itemInfo   :'data.frame':  104 obs. of  1 variable:
##   .. ..$ labels: chr [1:104] "Adjustment" "Afternoon with the baker" "Alfajores" "Argentina Night" ...
##   ..@ itemsetInfo:'data.frame':  6614 obs. of  1 variable:
##   .. ..$ transactionID: Factor w/ 6614 levels "1","10","1000",..: 1 2 3 4 5 6 7 8 9 10 ...

1 Data Dictionary

The data set contains 15.010 observations and the following columns,

2 Data Analysis

Before applying the Apriori algorithm on the data set, we are going to show some visualizations to learn more about the transactions. For example, we can generate an itemFrequencyPlot() to create an item Frequency Bar Plot to view the distribution of products.

The itemFrequencyPlot() allows us to show the absolute or relative values. If absolute it will plot numeric frequencies of each item independently. If relative it will plot how many times these items have appeared as compared to others, as it’s shown in the following plot.

Coffee is the best-selling product by far, followed by bread and tea. Let’s display some other visualizations describing the time distribution using the ggplot() function.

The data set includes dates from 30/10/2016 to 09/04/2017, that’s why we have so few transactions in October and April.

As we can see, Saturday is the busiest day in the bakery. Conversely, Wednesday is the day with fewer transactions.

There’s not much to discuss with this visualization. The results are logical and expected.

3 Apriori algorithm

3.1 Choice of support and confidence

The first step in order to create a set of association rules is to determine the optimal thresholds for support and confidence. If we set these values too low, then the algorithm will take longer to execute and we will get a lot of rules (most of them will not be useful). Then, what values do we choose? We can try different values of support and confidence and see graphically how many rules are generated for each combination.

In the following graphs we can see the number of rules generated with a support level of 10%, 5%, 1% and 0.5%.

We can join the four lines to improve the visualization.

Let’s analyze the results,

  • Support level of 10%. We only identify a few rules with very low confidence levels. This means that there are no relatively frequent associations in our data set. We can’t choose this value, the resulting rules are unrepresentative.

  • Support level of 5%. We only identify a rule with a confidence of at least 50%. It seems that we have to look for support levels below 5% to obtain a greater number of rules with a reasonable confidence.

  • Support level of 1%. We started to get dozens of rules, of which 13 have a confidence of at least 50%.

  • Support level of 0.5%. Too many rules to analyze!

To sum up, we are going to use a support level of 1% and a confidence level of 50%.

3.2 Execution

Let’s execute the Apriori algorithm with the values obtained in the previous section.

The generated association rules are the following,

##      lhs                 rhs      support    confidence lift     count
## [1]  {Tiffin}         => {Coffee} 0.01058361 0.5468750  1.134577  70  
## [2]  {Spanish Brunch} => {Coffee} 0.01406108 0.6326531  1.312537  93  
## [3]  {Scone}          => {Coffee} 0.01844572 0.5422222  1.124924 122  
## [4]  {Toast}          => {Coffee} 0.02570305 0.7296137  1.513697 170  
## [5]  {Alfajores}      => {Coffee} 0.02237678 0.5522388  1.145705 148  
## [6]  {Juice}          => {Coffee} 0.02131842 0.5300752  1.099723 141  
## [7]  {Hot chocolate}  => {Coffee} 0.02721500 0.5263158  1.091924 180  
## [8]  {Medialuna}      => {Coffee} 0.03296039 0.5751979  1.193337 218  
## [9]  {Cookies}        => {Coffee} 0.02978530 0.5267380  1.092800 197  
## [10] {NONE}           => {Coffee} 0.04172966 0.5810526  1.205484 276  
## [11] {Sandwich}       => {Coffee} 0.04233444 0.5679513  1.178303 280  
## [12] {Pastry}         => {Coffee} 0.04868461 0.5590278  1.159790 322  
## [13] {Cake}           => {Coffee} 0.05654672 0.5389049  1.118042 374

We can also create an HTML table widget using the inspectDT() function from the aruslesViz package. Rules can be interactively filtered and sorted.

How do we interpret these rules?

  • 52% of the customers who bought a hot chocolate also bought a coffee.

  • 63% of the customers who bought a spanish brunch also bought a coffee.

  • 73% of the customers who bought a toast also bought a coffee.

And so on. It seems that in this bakery there are many coffee lovers.

3.3 Visualize association rules

We are going to use the arulesViz package to create the visualizations. Let’s begin with a simple scatter plot with different measures of interestingness on the axes (lift and support) and a third measure (confidence) represented by the color of the points.

The following visualization represents the rules as a graph with items as labeled vertices, and rules represented as vertices connected to items using arrows.

We can also change the graph layout.

What else can we do? We can represent the rules as a grouped matrix-based visualization. The support and lift measures are represented by the size and color of the ballons, respectively. In this case it’s not a very useful visualization, since we only have coffe on the right-hand-side of the rules.

There’s an awesome function called ruleExplorer() that explores association rules using interactive manipulations and visualization using shiny. Unfortunately, R Markdown still doesn’t support shiny app objects.

3.4 Another execution

We have executed the Apriori algorithm with the appropriate support and confidence values. What happens if we execute it with low values? How do the visualizations change? Let’s try with a support level of 0.5% and a confidence level of 10%.

It’s impossible to analyze these visualizations! For larger rule sets visual analysis becomes difficult. Furthermore, most of the rules are useless. That’s why we have to carefully select the right values of support and confidence.

4 Summary

In this kernel we have learned about the Apriori algorithm, one of the most frequently used algorithms in data mining. We have reviewed some statistical concepts (support, confidence, lift and conviction) to select interesting rules, we have chosen the appropriate values to execute the algorithm and finally we have visualized the resulting association rules.

.