Association Rule Mining (ARM) is a data minin technique that focuses on mining for associations between itemsets for further applications.

We will use package arules for ARM and arulesViz for AR visualization.
The data set will be the built-in Groceries data.

setwd("D:/Class Materials & Work/Summer 2020 practice/ARM")

library(arules) #for ARM
library(arulesViz) #for ARM visualization

Loading the data set and inspect it.

data(Groceries)

class(Groceries)
## [1] "transactions"
## attr(,"package")
## [1] "arules"
Groceries
## transactions in sparse format with
##  9835 transactions (rows) and
##  169 items (columns)

The data is a transactional data set with N = 9835 and 169 variables.
We can look at the data in item set level by using arules::inspect. Be careful to specify the number of rows to avoid flooding your output, as well as indicate whether you want to inspect from the top (head) or below (tail).

inspect(head(Groceries, 2))
##     items                
## [1] {citrus fruit,       
##      semi-finished bread,
##      margarine,          
##      ready soups}        
## [2] {tropical fruit,     
##      yogurt,             
##      coffee}

The result above is the first 2 transactions.

How to see the most frequent items?

The eclat() takes in a transactions object and gives the most frequent items in the data based the support you provide to the supp argument. The maxlen defines the maximum number of items in each itemset of frequent items.

frequentItems <- eclat (Groceries, parameter = list(supp = 0.07, maxlen = 15)) # calculates support for frequent items
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.07      1     15 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 688 
## 
## create itemset ... 
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [18 item(s)] done [0.00s].
## creating sparse bit matrix ... [18 row(s), 9835 column(s)] done [0.00s].
## writing  ... [19 set(s)] done [0.01s].
## Creating S4 object  ... done [0.00s].
inspect(frequentItems)
##      items                         support    transIdenticalToItemsets count
## [1]  {other vegetables,whole milk} 0.07483477  736                      736 
## [2]  {whole milk}                  0.25551601 2513                     2513 
## [3]  {other vegetables}            0.19349263 1903                     1903 
## [4]  {rolls/buns}                  0.18393493 1809                     1809 
## [5]  {yogurt}                      0.13950178 1372                     1372 
## [6]  {soda}                        0.17437722 1715                     1715 
## [7]  {root vegetables}             0.10899847 1072                     1072 
## [8]  {tropical fruit}              0.10493137 1032                     1032 
## [9]  {bottled water}               0.11052364 1087                     1087 
## [10] {sausage}                     0.09395018  924                      924 
## [11] {shopping bags}               0.09852567  969                      969 
## [12] {citrus fruit}                0.08276563  814                      814 
## [13] {pastry}                      0.08896797  875                      875 
## [14] {pip fruit}                   0.07564820  744                      744 
## [15] {whipped/sour cream}          0.07168277  705                      705 
## [16] {fruit/vegetable juice}       0.07229283  711                      711 
## [17] {newspapers}                  0.07981698  785                      785 
## [18] {bottled beer}                0.08052872  792                      792 
## [19] {canned beer}                 0.07768175  764                      764

We can plot item frequency with itemFrequencyPlot.

itemFrequencyPlot(Groceries, topN=10, type="absolute", main="Item Frequency")

Generating Rules

We will generate parameters support and confidence for rule mining and lift for interestingness evaluation.

Lets find out the rules using the apriori algorithm.

grocery_rules <- apriori(Groceries, parameter = list(support = 0.01, confidence = 0.5))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 98 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(grocery_rules)
## set of 15 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3 
## 15 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       3       3       3       3       3       3 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.01007   Min.   :0.5000   Min.   :0.01729   Min.   :1.984  
##  1st Qu.:0.01174   1st Qu.:0.5151   1st Qu.:0.02089   1st Qu.:2.036  
##  Median :0.01230   Median :0.5245   Median :0.02430   Median :2.203  
##  Mean   :0.01316   Mean   :0.5411   Mean   :0.02454   Mean   :2.299  
##  3rd Qu.:0.01403   3rd Qu.:0.5718   3rd Qu.:0.02598   3rd Qu.:2.432  
##  Max.   :0.02227   Max.   :0.5862   Max.   :0.04342   Max.   :3.030  
##      count      
##  Min.   : 99.0  
##  1st Qu.:115.5  
##  Median :121.0  
##  Mean   :129.4  
##  3rd Qu.:138.0  
##  Max.   :219.0  
## 
## mining info:
##       data ntransactions support confidence
##  Groceries          9835    0.01        0.5

The Apriori algorithm generated 15 rules with the given constraints (parameters). Lets dive into the Parameter Specification section of the output.

We can inspect the top three rules sorted by confidence.

inspect(head(sort(grocery_rules, by = "confidence"), 3))
##     lhs                                 rhs                support   
## [1] {citrus fruit,root vegetables}   => {other vegetables} 0.01037112
## [2] {tropical fruit,root vegetables} => {other vegetables} 0.01230300
## [3] {curd,yogurt}                    => {whole milk}       0.01006609
##     confidence coverage   lift     count
## [1] 0.5862069  0.01769192 3.029608 102  
## [2] 0.5845411  0.02104728 3.020999 121  
## [3] 0.5823529  0.01728521 2.279125  99

Visualizing Association Rules

Package arulesViz supports visualization of association rules with scatter plot, balloon plot, graph, parallel coordinates plot, etc.

#scatter plot as sorted by parameters
plot(grocery_rules)

#Graph plot for items
plot(grocery_rules, method="graph", control=list(verbose = FALSE))

#Parallel coordinate plot
plot(grocery_rules, method="paracoord", control=list(reorder=TRUE))

Limiting the number of rules generated (Rule prunning)

We can limit the number of generated rules to filter in only the significant rules for further use.

wholemilk_rules <- apriori(data=Groceries, parameter=list (supp=0.001,conf = 0.08), 
                           appearance = list (rhs="whole milk"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.08    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.02s].
## writing ... [3765 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(wholemilk_rules)
## set of 3765 rules
## 
## rule length distribution (lhs + rhs):sizes
##    1    2    3    4    5    6 
##    1  134 1503 1792  325   10 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    4.00    3.62    4.00    6.00 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001017   Min.   :0.1071   Min.   :0.001017   Min.   :0.4193  
##  1st Qu.:0.001118   1st Qu.:0.4783   1st Qu.:0.001932   1st Qu.:1.8717  
##  Median :0.001423   Median :0.5702   Median :0.002644   Median :2.2317  
##  Mean   :0.002348   Mean   :0.5749   Mean   :0.004952   Mean   :2.2500  
##  3rd Qu.:0.002237   3rd Qu.:0.6667   3rd Qu.:0.004372   3rd Qu.:2.6091  
##  Max.   :0.255516   Max.   :1.0000   Max.   :1.000000   Max.   :3.9136  
##      count        
##  Min.   :  10.00  
##  1st Qu.:  11.00  
##  Median :  14.00  
##  Mean   :  23.09  
##  3rd Qu.:  22.00  
##  Max.   :2513.00  
## 
## mining info:
##       data ntransactions support confidence
##  Groceries          9835   0.001       0.08

The above code shows what products are bought before buying “whole milk” and will generate rules that lead to buying “whole milk”.

There is over 3000 rules, which is too much for a single use. You can limit the number of rules by tweaking a few parameters depending on the type of data. The most common ways include changing support, confidence and other parameters like minlen, maxlen etc.

grocery_rules_increased_support <- apriori(Groceries, parameter = list(support = 0.02, confidence = 0.5))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.02      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 196 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [59 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [1 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(grocery_rules_increased_support)
## set of 1 rules
## 
## rule length distribution (lhs + rhs):sizes
## 3 
## 1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       3       3       3       3       3       3 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.02227   Min.   :0.5129   Min.   :0.04342   Min.   :2.007  
##  1st Qu.:0.02227   1st Qu.:0.5129   1st Qu.:0.04342   1st Qu.:2.007  
##  Median :0.02227   Median :0.5129   Median :0.04342   Median :2.007  
##  Mean   :0.02227   Mean   :0.5129   Mean   :0.04342   Mean   :2.007  
##  3rd Qu.:0.02227   3rd Qu.:0.5129   3rd Qu.:0.04342   3rd Qu.:2.007  
##  Max.   :0.02227   Max.   :0.5129   Max.   :0.04342   Max.   :2.007  
##      count    
##  Min.   :219  
##  1st Qu.:219  
##  Median :219  
##  Mean   :219  
##  3rd Qu.:219  
##  Max.   :219  
## 
## mining info:
##       data ntransactions support confidence
##  Groceries          9835    0.02        0.5

We increased support by 0.01, and that yields a total of one rule.

If you want to get stronger rules, you have to increase the confidence. If you want lengthier rules, increase the maxlen parameter. If you want to eliminate shorter rules, decrease the minlen parameter.

Sometimes you might be interested in finding the rules involving maximum number of items and remove the shorter rules that are subsets of the longer rules, which are considered as redundant.

subsets <- which(colSums(is.subset(wholemilk_rules, grocery_rules)) > 1) #remove subset rules that are related to wholemilk. 

length(subsets)
## [1] 11
grocery_rules_prunned <- grocery_rules[-subsets]

grocery_rules_prunned
## set of 4 rules
plot(grocery_rules_prunned, method="paracoord", control=list(reorder=TRUE))

We can see that all rules are gone. The prunning can be adjusted based on the nature of data. The lower the support is, the more rule will be yielded, 0.0001, for instance.