Data Mining Association Rules

Association rule mining is a methodology that is used to discover unknown relationships hidden in big data. Rules refer to a set of identified frequent itemsets that represent the uncovered relationships in the dataset.

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.[1] Based on the concept of strong rules, Rakesh Agrawal, Tomasz Imieliński and Arun Swami [2] introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule {onions,potatoes} => {burger} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat.

Applications:

Understanding the customer purchasing behaviour by using association rule mining enables different applications. As shown above rules help to identify new opportunities and ways for cross-selling products to customers. It is used for personalised marketing promotions, smarter inventory management, product placement strategies in stores, and a better customer relationship management.

Association rules

Mining association rules was fist introduced by Agrawal, Imielinski, and Swami (1993) and can formally be defined as: Let I = {i1, i2, . . . , in} be a set of n binary attributes called items. Let D = {t1, t2, . . . , tm} be a set of transactions called the database. Each transaction in D has an unique transaction ID and contains a subset of the items in I.

A rule is defined as an implication of the form X ⇒ Y where X, Y ⊆ I and X ∩ Y = ∅. The sets of items (for short itemsets) X and Y are called antecedent (left-hand-side or LHS) and consequent (right-hand-side or RHS) of the rule

Association rules are rules which surpass a user-specified minimum support and minimum confidence threshold.

Support

The support supp(X) of an itemset X is defined as the proportion of transactions in the data set which contain the itemset

Confidence

The confidence of a rule is defined conf(X ⇒ Y ) = supp(X ∪ Y )/supp(X).

Lift

Another popular measure for association rules used throughout this paper is lift (Brin, Motwani, Ullman, and Tsur 1997). The lift of a rule is defined as lift(X ⇒ Y ) = supp(X ∪ Y )/(supp(X)supp(Y ))

library(arules)
## Warning: package 'arules' was built under R version 3.3.3
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library("arulesViz")
## Warning: package 'arulesViz' was built under R version 3.3.3
## Loading required package: grid

Step 1: Read the data

data("Groceries")
summary(Groceries)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55 
##   16   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   46   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##        labels  level2           level1
## 1 frankfurter sausage meat and sausage
## 2     sausage sausage meat and sausage
## 3  liver loaf sausage meat and sausage
inspect(Groceries[1:5])
##     items                     
## [1] {citrus fruit,            
##      semi-finished bread,     
##      margarine,               
##      ready soups}             
## [2] {tropical fruit,          
##      yogurt,                  
##      coffee}                  
## [3] {whole milk}              
## [4] {pip fruit,               
##      yogurt,                  
##      cream cheese ,           
##      meat spreads}            
## [5] {other vegetables,        
##      whole milk,              
##      condensed milk,          
##      long life bakery product}
itemFrequencyPlot(Groceries, topN = 25)

Step 2: Find the association rules

rules <- apriori(Groceries, parameter=list(support=0.001, confidence=0.5))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [5668 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
inspect(head(sort(rules, by ="lift"),5))
##     lhs                        rhs                  support confidence     lift
## [1] {Instant food products,                                                    
##      soda}                  => {hamburger meat} 0.001220132  0.6315789 18.99565
## [2] {soda,                                                                     
##      popcorn}               => {salty snack}    0.001220132  0.6315789 16.69779
## [3] {flour,                                                                    
##      baking powder}         => {sugar}          0.001016777  0.5555556 16.40807
## [4] {ham,                                                                      
##      processed cheese}      => {white bread}    0.001931876  0.6333333 15.04549
## [5] {whole milk,                                                               
##      Instant food products} => {hamburger meat} 0.001525165  0.5000000 15.03823

Step 3: Visulization of the Itemsets

rules <- apriori(Groceries, parameter=list(support=0.001, confidence=0.8))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [410 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
itemsets <- eclat(Groceries, parameter = list(support = 0.02, minlen=2))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target   ext
##     FALSE    0.02      2     10 frequent itemsets FALSE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 196 
## 
## create itemset ... 
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [59 item(s)] done [0.00s].
## creating sparse bit matrix ... [59 row(s), 9835 column(s)] done [0.00s].
## writing  ... [63 set(s)] done [0.01s].
## Creating S4 object  ... done [0.00s].
plot(itemsets, method="graph")

itemsets <- eclat(Groceries, parameter = list(support = 0.03, minlen=2))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target   ext
##     FALSE    0.03      2     10 frequent itemsets FALSE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 295 
## 
## create itemset ... 
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [44 item(s)] done [0.00s].
## creating sparse bit matrix ... [44 row(s), 9835 column(s)] done [0.00s].
## writing  ... [19 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].
plot(itemsets, method="graph")

itemsets <- eclat(Groceries, parameter = list(support = 0.04, minlen=2))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target   ext
##     FALSE    0.04      2     10 frequent itemsets FALSE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 393 
## 
## create itemset ... 
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [32 item(s)] done [0.00s].
## creating sparse bit matrix ... [32 row(s), 9835 column(s)] done [0.00s].
## writing  ... [9 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].
plot(itemsets, method="graph")

plot(itemsets, method="paracoord", control=list(alpha=0.8, reorder=TRUE))

quality(itemsets) <- interestMeasure(itemsets, trans=Groceries)
head(quality(itemsets))
##      support allConfidence crossSupportRatio      lift
## 1 0.04229792     0.1655392         0.4106645 1.5775950
## 2 0.04890696     0.1914047         0.4265818 1.7560310
## 3 0.04738180     0.2448765         0.5633211 2.2466049
## 4 0.04006101     0.1567847         0.6824513 0.8991124
## 5 0.05602440     0.2192598         0.5459610 1.5717351
## 6 0.04341637     0.2243826         0.7209669 1.6084566
plot(itemsets, measure=c("support", "allConfidence"), shading="lift",control = list(col=rainbow(7)))

Step 4: Visulization of the association rules

oneRule <- sample(rules, 1)
inspect(oneRule)
##     lhs                  rhs                    support confidence     lift
## [1] {tropical fruit,                                                       
##      root vegetables,                                                      
##      oil}             => {other vegetables} 0.001728521       0.85 4.392932
plot(oneRule, method="doubledecker", data = Groceries)

plot(rules, control = list(col=rainbow(7)))

subrules <- subset(rules, lift>5)
plot(subrules, method="matrix3D", measure="lift", control=list(reorder=TRUE))
## Itemsets in Antecedent (LHS)
##  [1] "{tropical fruit,other vegetables,whole milk,yogurt,oil}"         
##  [2] "{citrus fruit,other vegetables,soda,fruit/vegetable juice}"      
##  [3] "{tropical fruit,other vegetables,whole milk,oil}"                
##  [4] "{other vegetables,whole milk,yogurt,rice}"                       
##  [5] "{beef,citrus fruit,tropical fruit,other vegetables}"             
##  [6] "{whole milk,rolls/buns,soda,newspapers}"                         
##  [7] "{ham,tropical fruit,pip fruit,yogurt}"                           
##  [8] "{citrus fruit,tropical fruit,root vegetables,whipped/sour cream}"
##  [9] "{citrus fruit,root vegetables,soft cheese}"                      
## [10] "{tropical fruit,butter,whipped/sour cream,fruit/vegetable juice}"
## [11] "{ham,tropical fruit,pip fruit,whole milk}"                       
## [12] "{tropical fruit,grapes,whole milk,yogurt}"                       
## [13] "{pip fruit,whipped/sour cream,brown bread}"                      
## [14] "{other vegetables,butter milk,pastry}"                           
## [15] "{whipped/sour cream,pastry,fruit/vegetable juice}"               
## [16] "{tropical fruit,root vegetables,whole milk,margarine}"           
## [17] "{beef,tropical fruit,butter}"                                    
## [18] "{whipped/sour cream,cream cheese ,margarine}"                    
## [19] "{tropical fruit,other vegetables,butter,curd}"                   
## [20] "{pork,tropical fruit,fruit/vegetable juice}"                     
## [21] "{tropical fruit,butter,white bread}"                             
## [22] "{whole milk,curd,whipped/sour cream,cream cheese }"              
## [23] "{tropical fruit,butter,margarine}"                               
## [24] "{tropical fruit,whole milk,butter,curd}"                         
## [25] "{sausage,pip fruit,sliced cheese}"                               
## [26] "{tropical fruit,whole milk,butter,sliced cheese}"                
## [27] "{tropical fruit,other vegetables,butter,white bread}"            
## [28] "{other vegetables,curd,whipped/sour cream,cream cheese }"        
## [29] "{root vegetables,butter,cream cheese }"                          
## [30] "{ham,pip fruit,other vegetables,yogurt}"                         
## [31] "{citrus fruit,grapes,fruit/vegetable juice}"                     
## [32] "{liquor,red/blush wine}"                                         
## Itemsets in Consequent (RHS)
## [1] "{yogurt}"           "{other vegetables}" "{bottled beer}"    
## [4] "{tropical fruit}"   "{root vegetables}"

subrules2 <- head(sort(rules, by="lift"), 10)
plot(subrules2, method="graph")

# A parallel coordinates plot for 10 rules 
# The width of the arrows represents support 
# The intensity of the color represent confidence

plot(subrules2, method="paracoord", control=list(col=3,alpha=1, reorder=TRUE))