Association Rule Mining practice

Association Rule Mining (ARM) is a data minin technique that focuses on mining for associations between itemsets for further applications.

We will use package arules for ARM and arulesViz for AR visualization.
The data set will be the built-in Groceries data.

setwd("D:/Class Materials & Work/Summer 2020 practice/ARM")

library(arules) #for ARM
library(arulesViz) #for ARM visualization

Loading the data set and inspect it.

data(Groceries)

class(Groceries)

## [1] "transactions"
## attr(,"package")
## [1] "arules"

Groceries

## transactions in sparse format with
##  9835 transactions (rows) and
##  169 items (columns)

The data is a transactional data set with N = 9835 and 169 variables.
We can look at the data in item set level by using arules::inspect. Be careful to specify the number of rows to avoid flooding your output, as well as indicate whether you want to inspect from the top (head) or below (tail).

inspect(head(Groceries, 2))

##     items                
## [1] {citrus fruit,       
##      semi-finished bread,
##      margarine,          
##      ready soups}        
## [2] {tropical fruit,     
##      yogurt,             
##      coffee}

The result above is the first 2 transactions.

How to see the most frequent items?

The eclat() takes in a transactions object and gives the most frequent items in the data based the support you provide to the supp argument. The maxlen defines the maximum number of items in each itemset of frequent items.

frequentItems <- eclat (Groceries, parameter = list(supp = 0.07, maxlen = 15)) # calculates support for frequent items

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.07      1     15 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 688 
## 
## create itemset ... 
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [18 item(s)] done [0.00s].
## creating sparse bit matrix ... [18 row(s), 9835 column(s)] done [0.00s].
## writing  ... [19 set(s)] done [0.01s].
## Creating S4 object  ... done [0.00s].

inspect(frequentItems)

##      items                         support    transIdenticalToItemsets count
## [1]  {other vegetables,whole milk} 0.07483477  736                      736 
## [2]  {whole milk}                  0.25551601 2513                     2513 
## [3]  {other vegetables}            0.19349263 1903                     1903 
## [4]  {rolls/buns}                  0.18393493 1809                     1809 
## [5]  {yogurt}                      0.13950178 1372                     1372 
## [6]  {soda}                        0.17437722 1715                     1715 
## [7]  {root vegetables}             0.10899847 1072                     1072 
## [8]  {tropical fruit}              0.10493137 1032                     1032 
## [9]  {bottled water}               0.11052364 1087                     1087 
## [10] {sausage}                     0.09395018  924                      924 
## [11] {shopping bags}               0.09852567  969                      969 
## [12] {citrus fruit}                0.08276563  814                      814 
## [13] {pastry}                      0.08896797  875                      875 
## [14] {pip fruit}                   0.07564820  744                      744 
## [15] {whipped/sour cream}          0.07168277  705                      705 
## [16] {fruit/vegetable juice}       0.07229283  711                      711 
## [17] {newspapers}                  0.07981698  785                      785 
## [18] {bottled beer}                0.08052872  792                      792 
## [19] {canned beer}                 0.07768175  764                      764

We can plot item frequency with itemFrequencyPlot.

itemFrequencyPlot(Groceries, topN=10, type="absolute", main="Item Frequency")

Generating Rules

We will generate parameters support and confidence for rule mining and lift for interestingness evaluation.

Support is an indication of how frequently the itemset appears in the dataset. For example, the support of the item citrus fruit is 1/2 as it appears in only 1 out of the two transactions.
Confidence is the proportion of the true positive of the rule.

Lets find out the rules using the apriori algorithm.

grocery_rules <- apriori(Groceries, parameter = list(support = 0.01, confidence = 0.5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 98 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

summary(grocery_rules)

## set of 15 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3 
## 15 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       3       3       3       3       3       3 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.01007   Min.   :0.5000   Min.   :0.01729   Min.   :1.984  
##  1st Qu.:0.01174   1st Qu.:0.5151   1st Qu.:0.02089   1st Qu.:2.036  
##  Median :0.01230   Median :0.5245   Median :0.02430   Median :2.203  
##  Mean   :0.01316   Mean   :0.5411   Mean   :0.02454   Mean   :2.299  
##  3rd Qu.:0.01403   3rd Qu.:0.5718   3rd Qu.:0.02598   3rd Qu.:2.432  
##  Max.   :0.02227   Max.   :0.5862   Max.   :0.04342   Max.   :3.030  
##      count      
##  Min.   : 99.0  
##  1st Qu.:115.5  
##  Median :121.0  
##  Mean   :129.4  
##  3rd Qu.:138.0  
##  Max.   :219.0  
## 
## mining info:
##       data ntransactions support confidence
##  Groceries          9835    0.01        0.5

The Apriori algorithm generated 15 rules with the given constraints (parameters). Lets dive into the Parameter Specification section of the output.

minval is the minimum value of the support an itemset should satisfy to be a part of a rule.
smax is the maximum support value for an itemset.
arem is an Additional Rule Evaluation Parameter (similar to lift).
aval is a logical indicating whether to return the additional rule evaluation measure selected with arem.
originalSupport is th traditional support value that consider both LHS and RHS items for calculating support. If you want to use only the LHS items for the calculation then you need to set this to FALSE.
maxtime is the maximum amount of time allowed to check for subsets.
minlen is the minimum number of items required in the rule.
maxlen is the maximum number of items that can be present in the rule.

We can inspect the top three rules sorted by confidence.

inspect(head(sort(grocery_rules, by = "confidence"), 3))

##     lhs                                 rhs                support   
## [1] {citrus fruit,root vegetables}   => {other vegetables} 0.01037112
## [2] {tropical fruit,root vegetables} => {other vegetables} 0.01230300
## [3] {curd,yogurt}                    => {whole milk}       0.01006609
##     confidence coverage   lift     count
## [1] 0.5862069  0.01769192 3.029608 102  
## [2] 0.5845411  0.02104728 3.020999 121  
## [3] 0.5823529  0.01728521 2.279125  99

Visualizing Association Rules

Package arulesViz supports visualization of association rules with scatter plot, balloon plot, graph, parallel coordinates plot, etc.

#scatter plot as sorted by parameters
plot(grocery_rules)

#Graph plot for items
plot(grocery_rules, method="graph", control=list(verbose = FALSE))

#Parallel coordinate plot
plot(grocery_rules, method="paracoord", control=list(reorder=TRUE))

Limiting the number of rules generated (Rule prunning)

We can limit the number of generated rules to filter in only the significant rules for further use.

wholemilk_rules <- apriori(data=Groceries, parameter=list (supp=0.001,conf = 0.08), 
                           appearance = list (rhs="whole milk"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.08    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.02s].
## writing ... [3765 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

summary(wholemilk_rules)

## set of 3765 rules
## 
## rule length distribution (lhs + rhs):sizes
##    1    2    3    4    5    6 
##    1  134 1503 1792  325   10 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    4.00    3.62    4.00    6.00 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.001017   Min.   :0.1071   Min.   :0.001017   Min.   :0.4193  
##  1st Qu.:0.001118   1st Qu.:0.4783   1st Qu.:0.001932   1st Qu.:1.8717  
##  Median :0.001423   Median :0.5702   Median :0.002644   Median :2.2317  
##  Mean   :0.002348   Mean   :0.5749   Mean   :0.004952   Mean   :2.2500  
##  3rd Qu.:0.002237   3rd Qu.:0.6667   3rd Qu.:0.004372   3rd Qu.:2.6091  
##  Max.   :0.255516   Max.   :1.0000   Max.   :1.000000   Max.   :3.9136  
##      count        
##  Min.   :  10.00  
##  1st Qu.:  11.00  
##  Median :  14.00  
##  Mean   :  23.09  
##  3rd Qu.:  22.00  
##  Max.   :2513.00  
## 
## mining info:
##       data ntransactions support confidence
##  Groceries          9835   0.001       0.08

The above code shows what products are bought before buying “whole milk” and will generate rules that lead to buying “whole milk”.

There is over 3000 rules, which is too much for a single use. You can limit the number of rules by tweaking a few parameters depending on the type of data. The most common ways include changing support, confidence and other parameters like minlen, maxlen etc.

grocery_rules_increased_support <- apriori(Groceries, parameter = list(support = 0.02, confidence = 0.5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.02      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 196 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [59 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [1 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

summary(grocery_rules_increased_support)

## set of 1 rules
## 
## rule length distribution (lhs + rhs):sizes
## 3 
## 1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       3       3       3       3       3       3 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.02227   Min.   :0.5129   Min.   :0.04342   Min.   :2.007  
##  1st Qu.:0.02227   1st Qu.:0.5129   1st Qu.:0.04342   1st Qu.:2.007  
##  Median :0.02227   Median :0.5129   Median :0.04342   Median :2.007  
##  Mean   :0.02227   Mean   :0.5129   Mean   :0.04342   Mean   :2.007  
##  3rd Qu.:0.02227   3rd Qu.:0.5129   3rd Qu.:0.04342   3rd Qu.:2.007  
##  Max.   :0.02227   Max.   :0.5129   Max.   :0.04342   Max.   :2.007  
##      count    
##  Min.   :219  
##  1st Qu.:219  
##  Median :219  
##  Mean   :219  
##  3rd Qu.:219  
##  Max.   :219  
## 
## mining info:
##       data ntransactions support confidence
##  Groceries          9835    0.02        0.5

We increased support by 0.01, and that yields a total of one rule.

If you want to get stronger rules, you have to increase the confidence. If you want lengthier rules, increase the maxlen parameter. If you want to eliminate shorter rules, decrease the minlen parameter.

Sometimes you might be interested in finding the rules involving maximum number of items and remove the shorter rules that are subsets of the longer rules, which are considered as redundant.

subsets <- which(colSums(is.subset(wholemilk_rules, grocery_rules)) > 1) #remove subset rules that are related to wholemilk. 

length(subsets)

## [1] 11

grocery_rules_prunned <- grocery_rules[-subsets]

grocery_rules_prunned

## set of 4 rules

plot(grocery_rules_prunned, method="paracoord", control=list(reorder=TRUE))

We can see that all rules are gone. The prunning can be adjusted based on the nature of data. The lower the support is, the more rule will be yielded, 0.0001, for instance.

How to Find Rules Related To Given Item/s ?

This can be achieved by modifying the appearance parameter in the apriori() function. For example, To find what factors influenced purchase of product X.

We will find out what customers had purchased before buying ‘Whole Milk’. However, we will not plot the rule due to their large number.

rules_before <- apriori (data=Groceries, parameter=list (supp=0.001,conf = 0.08), 
                                appearance = list (default="lhs",rhs="whole milk"), 
                                control = list (verbose=F)) # get rules that lead to buying 'whole milk'

rules_before_conf <- sort (rules_before, by="confidence", decreasing=TRUE) # 'high-confidence' rules.

inspect(head(rules_before_conf))

##     lhs                     rhs              support confidence    coverage     lift count
## [1] {rice,                                                                                
##      sugar}              => {whole milk} 0.001220132          1 0.001220132 3.913649    12
## [2] {canned fish,                                                                         
##      hygiene articles}   => {whole milk} 0.001118454          1 0.001118454 3.913649    11
## [3] {root vegetables,                                                                     
##      butter,                                                                              
##      rice}               => {whole milk} 0.001016777          1 0.001016777 3.913649    10
## [4] {root vegetables,                                                                     
##      whipped/sour cream,                                                                  
##      flour}              => {whole milk} 0.001728521          1 0.001728521 3.913649    17
## [5] {butter,                                                                              
##      soft cheese,                                                                         
##      domestic eggs}      => {whole milk} 0.001016777          1 0.001016777 3.913649    10
## [6] {pip fruit,                                                                           
##      butter,                                                                              
##      hygiene articles}   => {whole milk} 0.001016777          1 0.001016777 3.913649    10

To find out what products were purchased after/along with product X (consequential transaction), we adjust the Rright hand side (RHS) and Left hand side (LHS). Basically, this is for the “Customers who bought ‘Whole Milk’ also bought…” scenario.

rules_after <- apriori (data=Groceries, parameter=list (supp=0.001,conf = 0.15,minlen=2), 
                        appearance = list(default="rhs",lhs="whole milk"), 
                        control = list (verbose=F)) # those who bought 'milk' also bought..

rules_after_conf <- sort (rules_after, by="confidence", decreasing=TRUE) # 'high-confidence' rules.

inspect(head(rules_after_conf))

##     lhs             rhs                support    confidence coverage lift     
## [1] {whole milk} => {other vegetables} 0.07483477 0.2928770  0.255516 1.5136341
## [2] {whole milk} => {rolls/buns}       0.05663447 0.2216474  0.255516 1.2050318
## [3] {whole milk} => {yogurt}           0.05602440 0.2192598  0.255516 1.5717351
## [4] {whole milk} => {root vegetables}  0.04890696 0.1914047  0.255516 1.7560310
## [5] {whole milk} => {tropical fruit}   0.04229792 0.1655392  0.255516 1.5775950
## [6] {whole milk} => {soda}             0.04006101 0.1567847  0.255516 0.8991124
##     count
## [1] 736  
## [2] 557  
## [3] 551  
## [4] 481  
## [5] 416  
## [6] 394

plot(rules_after, method="paracoord", control=list(reorder=TRUE))

Association Rule Mining practice

Tarid Wongvorachan

November 6th, 2020

How to see the most frequent items?

Generating Rules

Visualizing Association Rules

Limiting the number of rules generated (Rule prunning)