Association rule mining is a methodology that is used to discover unknown relationships hidden in big data. Rules refer to a set of identified frequent itemsets that represent the uncovered relationships in the dataset.
Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.[1] Based on the concept of strong rules, Rakesh Agrawal, Tomasz Imieliński and Arun Swami [2] introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule {onions,potatoes} => {burger} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat.
Understanding the customer purchasing behaviour by using association rule mining enables different applications. As shown above rules help to identify new opportunities and ways for cross-selling products to customers. It is used for personalised marketing promotions, smarter inventory management, product placement strategies in stores, and a better customer relationship management.
Mining association rules was fist introduced by Agrawal, Imielinski, and Swami (1993) and can formally be defined as: Let I = {i1, i2, . . . , in} be a set of n binary attributes called items. Let D = {t1, t2, . . . , tm} be a set of transactions called the database. Each transaction in D has an unique transaction ID and contains a subset of the items in I.
A rule is defined as an implication of the form X ⇒ Y where X, Y ⊆ I and X ∩ Y = ∅. The sets of items (for short itemsets) X and Y are called antecedent (left-hand-side or LHS) and consequent (right-hand-side or RHS) of the rule
Association rules are rules which surpass a user-specified minimum support and minimum confidence threshold.
The support supp(X) of an itemset X is defined as the proportion of transactions in the data set which contain the itemset
The confidence of a rule is defined conf(X ⇒ Y ) = supp(X ∪ Y )/supp(X).
Another popular measure for association rules used throughout this paper is lift (Brin, Motwani, Ullman, and Tsur 1997). The lift of a rule is defined as lift(X ⇒ Y ) = supp(X ∪ Y )/(supp(X)supp(Y ))
library(arules)
## Warning: package 'arules' was built under R version 3.3.3
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library("arulesViz")
## Warning: package 'arulesViz' was built under R version 3.3.3
## Loading required package: grid
data("Groceries")
summary(Groceries)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55
## 16 17 18 19 20 21 22 23 24 26 27 28 29 32
## 46 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels level2 level1
## 1 frankfurter sausage meat and sausage
## 2 sausage sausage meat and sausage
## 3 liver loaf sausage meat and sausage
inspect(Groceries[1:5])
## items
## [1] {citrus fruit,
## semi-finished bread,
## margarine,
## ready soups}
## [2] {tropical fruit,
## yogurt,
## coffee}
## [3] {whole milk}
## [4] {pip fruit,
## yogurt,
## cream cheese ,
## meat spreads}
## [5] {other vegetables,
## whole milk,
## condensed milk,
## long life bakery product}
itemFrequencyPlot(Groceries, topN = 25)
rules <- apriori(Groceries, parameter=list(support=0.001, confidence=0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [5668 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(head(sort(rules, by ="lift"),5))
## lhs rhs support confidence lift
## [1] {Instant food products,
## soda} => {hamburger meat} 0.001220132 0.6315789 18.99565
## [2] {soda,
## popcorn} => {salty snack} 0.001220132 0.6315789 16.69779
## [3] {flour,
## baking powder} => {sugar} 0.001016777 0.5555556 16.40807
## [4] {ham,
## processed cheese} => {white bread} 0.001931876 0.6333333 15.04549
## [5] {whole milk,
## Instant food products} => {hamburger meat} 0.001525165 0.5000000 15.03823
rules <- apriori(Groceries, parameter=list(support=0.001, confidence=0.8))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [410 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
itemsets <- eclat(Groceries, parameter = list(support = 0.02, minlen=2))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.02 2 10 frequent itemsets FALSE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 196
##
## create itemset ...
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [59 item(s)] done [0.00s].
## creating sparse bit matrix ... [59 row(s), 9835 column(s)] done [0.00s].
## writing ... [63 set(s)] done [0.01s].
## Creating S4 object ... done [0.00s].
plot(itemsets, method="graph")
itemsets <- eclat(Groceries, parameter = list(support = 0.03, minlen=2))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.03 2 10 frequent itemsets FALSE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 295
##
## create itemset ...
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [44 item(s)] done [0.00s].
## creating sparse bit matrix ... [44 row(s), 9835 column(s)] done [0.00s].
## writing ... [19 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
plot(itemsets, method="graph")
itemsets <- eclat(Groceries, parameter = list(support = 0.04, minlen=2))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.04 2 10 frequent itemsets FALSE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 393
##
## create itemset ...
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [32 item(s)] done [0.00s].
## creating sparse bit matrix ... [32 row(s), 9835 column(s)] done [0.00s].
## writing ... [9 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
plot(itemsets, method="graph")
plot(itemsets, method="paracoord", control=list(alpha=0.8, reorder=TRUE))
quality(itemsets) <- interestMeasure(itemsets, trans=Groceries)
head(quality(itemsets))
## support allConfidence crossSupportRatio lift
## 1 0.04229792 0.1655392 0.4106645 1.5775950
## 2 0.04890696 0.1914047 0.4265818 1.7560310
## 3 0.04738180 0.2448765 0.5633211 2.2466049
## 4 0.04006101 0.1567847 0.6824513 0.8991124
## 5 0.05602440 0.2192598 0.5459610 1.5717351
## 6 0.04341637 0.2243826 0.7209669 1.6084566
plot(itemsets, measure=c("support", "allConfidence"), shading="lift",control = list(col=rainbow(7)))
oneRule <- sample(rules, 1)
inspect(oneRule)
## lhs rhs support confidence lift
## [1] {tropical fruit,
## root vegetables,
## oil} => {other vegetables} 0.001728521 0.85 4.392932
plot(oneRule, method="doubledecker", data = Groceries)
plot(rules, control = list(col=rainbow(7)))
subrules <- subset(rules, lift>5)
plot(subrules, method="matrix3D", measure="lift", control=list(reorder=TRUE))
## Itemsets in Antecedent (LHS)
## [1] "{tropical fruit,other vegetables,whole milk,yogurt,oil}"
## [2] "{citrus fruit,other vegetables,soda,fruit/vegetable juice}"
## [3] "{tropical fruit,other vegetables,whole milk,oil}"
## [4] "{other vegetables,whole milk,yogurt,rice}"
## [5] "{beef,citrus fruit,tropical fruit,other vegetables}"
## [6] "{whole milk,rolls/buns,soda,newspapers}"
## [7] "{ham,tropical fruit,pip fruit,yogurt}"
## [8] "{citrus fruit,tropical fruit,root vegetables,whipped/sour cream}"
## [9] "{citrus fruit,root vegetables,soft cheese}"
## [10] "{tropical fruit,butter,whipped/sour cream,fruit/vegetable juice}"
## [11] "{ham,tropical fruit,pip fruit,whole milk}"
## [12] "{tropical fruit,grapes,whole milk,yogurt}"
## [13] "{pip fruit,whipped/sour cream,brown bread}"
## [14] "{other vegetables,butter milk,pastry}"
## [15] "{whipped/sour cream,pastry,fruit/vegetable juice}"
## [16] "{tropical fruit,root vegetables,whole milk,margarine}"
## [17] "{beef,tropical fruit,butter}"
## [18] "{whipped/sour cream,cream cheese ,margarine}"
## [19] "{tropical fruit,other vegetables,butter,curd}"
## [20] "{pork,tropical fruit,fruit/vegetable juice}"
## [21] "{tropical fruit,butter,white bread}"
## [22] "{whole milk,curd,whipped/sour cream,cream cheese }"
## [23] "{tropical fruit,butter,margarine}"
## [24] "{tropical fruit,whole milk,butter,curd}"
## [25] "{sausage,pip fruit,sliced cheese}"
## [26] "{tropical fruit,whole milk,butter,sliced cheese}"
## [27] "{tropical fruit,other vegetables,butter,white bread}"
## [28] "{other vegetables,curd,whipped/sour cream,cream cheese }"
## [29] "{root vegetables,butter,cream cheese }"
## [30] "{ham,pip fruit,other vegetables,yogurt}"
## [31] "{citrus fruit,grapes,fruit/vegetable juice}"
## [32] "{liquor,red/blush wine}"
## Itemsets in Consequent (RHS)
## [1] "{yogurt}" "{other vegetables}" "{bottled beer}"
## [4] "{tropical fruit}" "{root vegetables}"
subrules2 <- head(sort(rules, by="lift"), 10)
plot(subrules2, method="graph")
# A parallel coordinates plot for 10 rules
# The width of the arrows represents support
# The intensity of the color represent confidence
plot(subrules2, method="paracoord", control=list(col=3,alpha=1, reorder=TRUE))