Association Rules

Introduction

Association Rules is a popular and well researched method for discovering interesting relations between variables in large databases. arules package in R provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting item sets and rules.

Apriori Algorithm is used to find all rules (the default association type for apriori()) with a minimum support of 0.3% and a confidence of 0.5. Example of Apriori algorithm is market basket analysis. It provides insights into which products tend to be purchased together and which are most amenable to promotion. The drawback of Apriori algorithm is that it is slow and can generate lot of rules which might be difficult to understand, although visualization, filtering techniques might help.

Data

The data used here is ‘Groceries’. The Groceries data set contains 1 month (30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. The data set contains 9835 transactions and the items are aggregated to 169 categories.The data set is provided for arules by Michael Hahsler, Kurt Hornik and Thomas Reutterer.

Libraries

Before proceeding, the important libraries that will be used for the analysis are

library(arules)
library(arulesViz)
library(tidyverse)

Importing the data

Importing the data and summary of the data

data("Groceries")
summary(Groceries)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55 
##   16   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   46   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##        labels  level2           level1
## 1 frankfurter sausage meat and sausage
## 2     sausage sausage meat and sausage
## 3  liver loaf sausage meat and sausage

Couple of insights that can be derived from the summary :

Density of 0.026 means 2.6 % are non-zero matrix cells.
Dimensions: 9835 * 169
Whole milk was purchased 2513 times which is 26% of the transactions
2519 transactions contained 1 item , while only 1 transaction had 32 items
First quartile and median is 2 and 3 , which means that 25% of the transactions had 2 items and about half contained 3 items

Let’s see what Grocery data has in store for us:

grocery <- as(Groceries, 'data.frame')
head(grocery)

##                                                                   items
## 1              {citrus fruit,semi-finished bread,margarine,ready soups}
## 2                                        {tropical fruit,yogurt,coffee}
## 3                                                          {whole milk}
## 4                         {pip fruit,yogurt,cream cheese ,meat spreads}
## 5 {other vegetables,whole milk,condensed milk,long life bakery product}
## 6                      {whole milk,butter,yogurt,rice,abrasive cleaner}

Frequency of the first five items:

itemFrequency(Groceries[,1:5])

## frankfurter     sausage  liver loaf         ham        meat 
## 0.058973055 0.093950178 0.005083884 0.026029487 0.025826131

Items with a support of atleast 10%

itemFrequencyPlot(Groceries, support= 0.10)

Items with a support of atleast 5%

itemFrequencyPlot(Groceries, support= 0.05)

Relative frequency of top 20 items

itemFrequencyPlot(Groceries, topN = 20)

Visualizing first 5 transactions

image(Groceries[1:5])

100 random transactions

image(sample(Groceries, 100))

Apriori algorithm

Finding items that are sold three times a day, therefore for a month, support = 90/9835

basket <- apriori(Groceries, parameter = list(support = 0.009, confidence = 0.25, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5   0.009      2
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 88 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [93 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [224 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

basket

## set of 224 rules

summary(basket)

## set of 224 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3 
## 111 113 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   2.504   3.000   3.000 
## 
## summary of quality measures:
##     support           confidence          lift            count      
##  Min.   :0.009049   Min.   :0.2513   Min.   :0.9932   Min.   : 89.0  
##  1st Qu.:0.010066   1st Qu.:0.2974   1st Qu.:1.5767   1st Qu.: 99.0  
##  Median :0.012303   Median :0.3603   Median :1.8592   Median :121.0  
##  Mean   :0.016111   Mean   :0.3730   Mean   :1.9402   Mean   :158.5  
##  3rd Qu.:0.018480   3rd Qu.:0.4349   3rd Qu.:2.2038   3rd Qu.:181.8  
##  Max.   :0.074835   Max.   :0.6389   Max.   :3.7969   Max.   :736.0  
## 
## mining info:
##       data ntransactions support confidence
##  Groceries          9835   0.009       0.25

Couple of insights from the summary are as follows:

Gave a set of 224 rules , rule length distribution gives us how many items are present in how many rules
2 items are present in 111 rules
3 items are present in 113 rules

inspect(basket[1:10])

##      lhs                rhs                support     confidence lift    
## [1]  {baking powder} => {whole milk}       0.009252669 0.5229885  2.046793
## [2]  {grapes}        => {other vegetables} 0.009049314 0.4045455  2.090754
## [3]  {meat}          => {other vegetables} 0.009964413 0.3858268  1.994013
## [4]  {meat}          => {whole milk}       0.009964413 0.3858268  1.509991
## [5]  {frozen meals}  => {whole milk}       0.009862735 0.3476703  1.360659
## [6]  {hard cheese}   => {other vegetables} 0.009456024 0.3858921  1.994350
## [7]  {hard cheese}   => {whole milk}       0.010066090 0.4107884  1.607682
## [8]  {butter milk}   => {other vegetables} 0.010371124 0.3709091  1.916916
## [9]  {butter milk}   => {whole milk}       0.011591256 0.4145455  1.622385
## [10] {ham}           => {other vegetables} 0.009150991 0.3515625  1.816930
##      count
## [1]   91  
## [2]   89  
## [3]   98  
## [4]   98  
## [5]   97  
## [6]   93  
## [7]   99  
## [8]  102  
## [9]  114  
## [10]  90

The first five rules are seen here. Also, we can see support for the top 10 most frequent items. We can see the lift column along with support and confidence. The lift of a rule measures how much likely an item or itemset is purchased relative to its typical rate of purchase, given that you know another item or itemsethas been purchased.

Sorting according to lift

inspect(sort(basket, by = "lift")[1:5])

##     lhs                   rhs                      support confidence     lift count
## [1] {berries}          => {whipped/sour cream} 0.009049314  0.2721713 3.796886    89
## [2] {tropical fruit,                                                                
##      other vegetables} => {pip fruit}          0.009456024  0.2634561 3.482649    93
## [3] {pip fruit,                                                                     
##      other vegetables} => {tropical fruit}     0.009456024  0.3618677 3.448613    93
## [4] {citrus fruit,                                                                  
##      other vegetables} => {root vegetables}    0.010371124  0.3591549 3.295045   102
## [5] {tropical fruit,                                                                
##      other vegetables} => {root vegetables}    0.012302999  0.3427762 3.144780   121

People who buy berries have 4 times tendency to buy whipped/sour cream than other customers

Taking subsets of Arules

Sometimes the marketing team requires to promote a specific product, say they want to promote berries, and want to find out how often and with which items the berries are purchased. The subset function enables one to find subsets of transactions, items or rules. The %in% operator is used for exact matching

Suppose I want to see it for berries:

berries <- subset(basket, items %in% "berries")
inspect(berries)

##     lhs          rhs                  support     confidence lift    
## [1] {berries} => {whipped/sour cream} 0.009049314 0.2721713  3.796886
## [2] {berries} => {yogurt}             0.010574479 0.3180428  2.279848
## [3] {berries} => {other vegetables}   0.010269446 0.3088685  1.596280
## [4] {berries} => {whole milk}         0.011794611 0.3547401  1.388328
##     count
## [1]  89  
## [2] 104  
## [3] 101  
## [4] 116

Yoghurt and whipped/sour cream turned up with which Berries is purchased

Scatter Plot for 224 rules

plot(basket)

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(basket, measure=c("support", "lift"), shading="confidence")

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Shading by order (number of items contained in the rule)

plot(basket, shading="order", control=list(main = "Two-key plot"))

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Interactive Scatter Plot

plot(basket, measure=c("support", "lift"), shading="confidence", interactive=TRUE)

Group-based visualization

plot(basket, method="grouped")

Graph-based visualization

plot(basket, method="graph", control=list(type="items"))

## Available control parameters (with default values):
## main  =  Graph for 100 rules
## nodeColors    =  c("#66CC6680", "#9999CC80")
## nodeCol   =  c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF",  "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF",  "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol   =  c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF",  "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF",  "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha     =  0.5
## cex   =  1
## itemLabels    =  TRUE
## labelCol  =  #000000B3
## measureLabels     =  FALSE
## precision     =  3
## layout    =  NULL
## layoutParams  =  list()
## arrowSize     =  0.5
## engine    =  igraph
## plot  =  TRUE
## plot_options  =  list()
## max   =  100
## verbose   =  FALSE