Association Rules

Author

Giuseppe A. Veltri

Association rules

Association analysis identifies relations or correlations between observations and/or between variables in our datasets.These relationships are then expressed as a collection of “association rules”.

Is a core technique of data mining. Is very useful for mining very large transactional databases, like shopping baskets and on-line customer purchases.

Knowledge Representation: Association Rules

General Format is A -> C Can generalize A (the antecedent) to a specific variable or value combination so can apply to various datasets

Search heuristic

Basis of an association analysis algorithm is the generation of frequent itemsets.

Is an “apriori algorithm”, a generate-and-test type of search algorithm. Only after exploring all of the possibilities of associations containing k items does it consider those containing K + 1 items. For each k, all candidates are tested to determine whether they have enough support.

A frequent itemset is a set of items that occur together frequently enough to be considered as a candidate for generating association rules.

Three Measures: Support, Confidence, Lift

Support

“Support” is a measure of how frequently the items must appear in the whole dataset before they can be considered as a candidate association rule.
“Support” for a collection of items is the proportion of all transactions in which the items appear together

support(A -> C) = P( A U C)

We use small values of “support” as are not looking for the obvious ones.

Confidence

The actual association rules that we retain are those that meet a criterion “confidence”.

“Confidence” calculates the proportion of transactions containing A that also contain C.

confidence (A -> C) = P(C|A) = P(A U C) / P(A)

confidence (A -> C) = support(A -> C) / support(A)

Typically looking for larger values of confidence.

Lift

“Lift” is the increased likelihood of C being in a transaction if A is included in the transaction:

lift(A -> C) = confidence(A -> C) / support(C)

Leverage

Leverage, which captures the fact that a higher frequency of A and C with a lower lift may be interesting:

leverage(A->C) = support(A->C)-support(A)*support(C)

Association rules in R

Two types of association rules were identified corresponding to the type of data available.

The simplest case, known as “market basket analysis”, is when we have a transaction dataset that records just a transaction identifer. The identifer might identify a single shopping basket containing multiple items from shopping or a particular customer or patient and their associated purchases or medical treatments over time.

A simple example of a market basket dataset might record the purchases of DVDs by customers (three customers in this case):

ID Item
1 Sixth Sense
1 LOTR1
1 Harry Potter1
1 Green Mile
1 LOTR2
2 Gladiator
2 Patriot
2 Braveheart
3 LOTR1
3 LOTR2

When loading a dataset to process with apriori() it must be converted into a transaction data structure. Consider a basket with two columns one being the identifier of the “basket” and the other being an item contained in the basket as is the case for the dvdtrans.csv data.

library("arules")
Loading required package: Matrix

Attaching package: 'arules'
The following objects are masked from 'package:base':

    abbreviate, write
library(readr)
dvdtrans <- read.csv("dvdtrans.csv")
str(dvdtrans)
'data.frame':   30 obs. of  2 variables:
 $ ID  : int  1 1 1 1 1 2 2 2 3 3 ...
 $ Item: chr  "Sixth Sense" "LOTR1" "Harry Potter1" "Green Mile" ...
dvdtrans
   ID          Item
1   1   Sixth Sense
2   1         LOTR1
3   1 Harry Potter1
4   1    Green Mile
5   1         LOTR2
6   2     Gladiator
7   2       Patriot
8   2    Braveheart
9   3         LOTR1
10  3         LOTR2
11  4     Gladiator
12  4       Patriot
13  4   Sixth Sense
14  5     Gladiator
15  5       Patriot
16  5   Sixth Sense
17  6     Gladiator
18  6       Patriot
19  6   Sixth Sense
20  7 Harry Potter1
21  7 Harry Potter2
22  8     Gladiator
23  8       Patriot
24  9     Gladiator
25  9       Patriot
26  9   Sixth Sense
27 10   Sixth Sense
28 10          LOTR
29 10     Gladiator
30 10    Green Mile
dvdDS <- new.env()
dvdDS$data <- as(split(dvdtrans$Item, dvdtrans$ID),
                 "transactions")
dvdDS$data
transactions in sparse format with
 10 transactions (rows) and
 10 items (columns)

We can then build the model using the tranformed dataset:

dvdAPRIORI <- new.env(parent=dvdDS)
evalq({
  model <- apriori(data, 
                   parameter=list(support=0.2,
                                  confidence=0.1))
}, dvdAPRIORI)
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.1    0.1    1 none FALSE            TRUE       5     0.2      1
 maxlen target  ext
     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 2 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[10 item(s), 10 transaction(s)] done [0.00s].
sorting and recoding items ... [7 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 done [0.00s].
writing ... [20 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].
dvdAPRIORI$model
set of 20 rules 

The rules can be extracted and ordered by confidence using inspect()

inspect(sort(dvdAPRIORI$model, 
             # limit display to first 5 rules
             by="confidence")[1:5])
    lhs                       rhs           support confidence coverage
[1] {LOTR1}                => {LOTR2}       0.2     1          0.2     
[2] {LOTR2}                => {LOTR1}       0.2     1          0.2     
[3] {Green Mile}           => {Sixth Sense} 0.2     1          0.2     
[4] {Patriot}              => {Gladiator}   0.6     1          0.6     
[5] {Patriot, Sixth Sense} => {Gladiator}   0.4     1          0.4     
    lift     count
[1] 5.000000 2    
[2] 5.000000 2    
[3] 1.666667 2    
[4] 1.428571 6    
[5] 1.428571 4    
library(arulesViz)
plot(dvdAPRIORI$model, method = "graph", measure = "lift", shading = "confidence")

Example 2

You can add options to executable code like this

data(Groceries)
summary(Groceries)
transactions as itemMatrix in sparse format with
 9835 rows (elements/itemsets/transactions) and
 169 columns (items) and a density of 0.02609146 

most frequent items:
      whole milk other vegetables       rolls/buns             soda 
            2513             1903             1809             1715 
          yogurt          (Other) 
            1372            34055 

element (itemset/transaction) length distribution:
sizes
   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
  17   18   19   20   21   22   23   24   26   27   28   29   32 
  29   14   14    9   11    4    6    1    1    1    1    3    1 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   2.000   3.000   4.409   6.000  32.000 

includes extended item information - examples:
       labels  level2           level1
1 frankfurter sausage meat and sausage
2     sausage sausage meat and sausage
3  liver loaf sausage meat and sausage
rules <- apriori(Groceries, parameter = list(supp = 0.01, conf = 0.5))
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.5    0.1    1 none FALSE            TRUE       5    0.01      1
 maxlen target  ext
     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 98 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [88 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [15 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].
inspect(rules)
     lhs                                       rhs                support   
[1]  {curd, yogurt}                         => {whole milk}       0.01006609
[2]  {other vegetables, butter}             => {whole milk}       0.01148958
[3]  {other vegetables, domestic eggs}      => {whole milk}       0.01230300
[4]  {yogurt, whipped/sour cream}           => {whole milk}       0.01087951
[5]  {other vegetables, whipped/sour cream} => {whole milk}       0.01464159
[6]  {pip fruit, other vegetables}          => {whole milk}       0.01352313
[7]  {citrus fruit, root vegetables}        => {other vegetables} 0.01037112
[8]  {tropical fruit, root vegetables}      => {other vegetables} 0.01230300
[9]  {tropical fruit, root vegetables}      => {whole milk}       0.01199797
[10] {tropical fruit, yogurt}               => {whole milk}       0.01514997
[11] {root vegetables, yogurt}              => {other vegetables} 0.01291307
[12] {root vegetables, yogurt}              => {whole milk}       0.01453991
[13] {root vegetables, rolls/buns}          => {other vegetables} 0.01220132
[14] {root vegetables, rolls/buns}          => {whole milk}       0.01270971
[15] {other vegetables, yogurt}             => {whole milk}       0.02226741
     confidence coverage   lift     count
[1]  0.5823529  0.01728521 2.279125  99  
[2]  0.5736041  0.02003050 2.244885 113  
[3]  0.5525114  0.02226741 2.162336 121  
[4]  0.5245098  0.02074225 2.052747 107  
[5]  0.5070423  0.02887646 1.984385 144  
[6]  0.5175097  0.02613116 2.025351 133  
[7]  0.5862069  0.01769192 3.029608 102  
[8]  0.5845411  0.02104728 3.020999 121  
[9]  0.5700483  0.02104728 2.230969 118  
[10] 0.5173611  0.02928317 2.024770 149  
[11] 0.5000000  0.02582613 2.584078 127  
[12] 0.5629921  0.02582613 2.203354 143  
[13] 0.5020921  0.02430097 2.594890 120  
[14] 0.5230126  0.02430097 2.046888 125  
[15] 0.5128806  0.04341637 2.007235 219  
rules_sorted <- sort(rules, by = "lift", decreasing = TRUE)
inspect(rules_sorted[1:10])
     lhs                                  rhs                support   
[1]  {citrus fruit, root vegetables}   => {other vegetables} 0.01037112
[2]  {tropical fruit, root vegetables} => {other vegetables} 0.01230300
[3]  {root vegetables, rolls/buns}     => {other vegetables} 0.01220132
[4]  {root vegetables, yogurt}         => {other vegetables} 0.01291307
[5]  {curd, yogurt}                    => {whole milk}       0.01006609
[6]  {other vegetables, butter}        => {whole milk}       0.01148958
[7]  {tropical fruit, root vegetables} => {whole milk}       0.01199797
[8]  {root vegetables, yogurt}         => {whole milk}       0.01453991
[9]  {other vegetables, domestic eggs} => {whole milk}       0.01230300
[10] {yogurt, whipped/sour cream}      => {whole milk}       0.01087951
     confidence coverage   lift     count
[1]  0.5862069  0.01769192 3.029608 102  
[2]  0.5845411  0.02104728 3.020999 121  
[3]  0.5020921  0.02430097 2.594890 120  
[4]  0.5000000  0.02582613 2.584078 127  
[5]  0.5823529  0.01728521 2.279125  99  
[6]  0.5736041  0.02003050 2.244885 113  
[7]  0.5700483  0.02104728 2.230969 118  
[8]  0.5629921  0.02582613 2.203354 143  
[9]  0.5525114  0.02226741 2.162336 121  
[10] 0.5245098  0.02074225 2.052747 107  
library(arulesViz)

plot(rules_sorted, method = "graph", measure = "lift", shading = "confidence", interactive = FALSE)
Warning in plot.rules(rules_sorted, method = "graph", measure = "lift", : The
parameter interactive is deprecated. Use engine='interactive' instead.