The goal of market basket analysis is to find what things go together, provide information to guide product placement in stores, cross-category and co-marketing promotions.
# Association Rules for Market Basket Analysis (R)
library(arules) # association rules
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz) # data visualization of association rules
## Loading required package: grid
library(RColorBrewer) # color palettes for plots
We use grocery data set, which represents one month transaction data from a grocery outlet. The data set consists of N = 9835 market maskets across K = 169 genreically labelled grocery items.
data(Groceries) # grocery transactions object from arules package
# show the dimensions of the transactions object
print(dim(Groceries))
## [1] 9835 169
print(dim(Groceries)[1]) # 9835 market baskets for shopping trips
## [1] 9835
print(dim(Groceries)[2])
## [1] 169
A key challenge in market basket analysis and association rule modeling is the sheer number of rules that are generated. An association rule is a division of each item set into two subsets with one subset, the antecedent, thought of as preceeding the other subset, the consequent.
itemFrequencyPlot(Groceries, support = 0.025, cex.names=0.8, xlim = c(0,0.3),
type = "relative", horiz = TRUE, col = "dark red", las = 1,
xlab = paste("Proportion of Market Baskets Containing Item",
"\n(Item Relative Frequency or Support)"))
# explore possibilities for combining similar items
print(head(itemInfo(Groceries)))
## labels level2 level1
## 1 frankfurter sausage meat and sausage
## 2 sausage sausage meat and sausage
## 3 liver loaf sausage meat and sausage
## 4 ham sausage meat and sausage
## 5 meat sausage meat and sausage
## 6 finished products sausage meat and sausage
print(levels(itemInfo(Groceries)[["level1"]])) # 10 levels... too few
## [1] "canned food" "detergent" "drinks"
## [4] "fresh products" "fruit and vegetables" "meat and sausage"
## [7] "non-food" "perfumery" "processed food"
## [10] "snacks and candies"
print(levels(itemInfo(Groceries)[["level2"]])) # 55 distinct levels
## [1] "baby food" "bags"
## [3] "bakery improver" "bathroom cleaner"
## [5] "beef" "beer"
## [7] "bread and backed goods" "candy"
## [9] "canned fish" "canned fruit/vegetables"
## [11] "cheese" "chewing gum"
## [13] "chocolate" "cleaner"
## [15] "coffee" "condiments"
## [17] "cosmetics" "dairy produce"
## [19] "delicatessen" "dental care"
## [21] "detergent/softener" "eggs"
## [23] "fish" "frozen foods"
## [25] "fruit" "games/books/hobby"
## [27] "garden" "hair care"
## [29] "hard drinks" "health food"
## [31] "jam/sweet spreads" "long-life bakery products"
## [33] "meat spreads" "non-alc. drinks"
## [35] "non-food house keeping products" "non-food kitchen"
## [37] "packaged fruit/vegetables" "perfumery"
## [39] "personal hygiene" "pet food/care"
## [41] "pork" "poultry"
## [43] "pudding powder" "sausage"
## [45] "seasonal products" "shelf-stable dairy"
## [47] "snacks" "soap"
## [49] "soups/sauces" "staple foods"
## [51] "sweetener" "tea/cocoa drinks"
## [53] "vegetables" "vinegar/oils"
## [55] "wine"
# aggregate items using the 55 level2 levels for food categories
# to create a more meaningful set of items
groceries <- aggregate(Groceries, itemInfo(Groceries)[["level2"]])
print(dim(groceries)[1]) # 9835 market baskets for shopping trips
## [1] 9835
print(dim(groceries)[2]) # 55 final store items (categories)
## [1] 55
itemFrequencyPlot(groceries, support = 0.025, cex.names=1.0, xlim = c(0,0.5),
type = "relative", horiz = TRUE, col = "blue", las = 1,
xlab = paste("Proportion of Market Baskets Containing Item",
"\n(Item Relative Frequency or Support)"))
This algorithm deals with large number of association rules problem by using selection criteria that reflect the potential utility of association rules. The first criteria is referred to as ‘support’ of an item. It determines the proportion of times an item occurs in store data set. The second criteria is ‘confidence’ or predictability of an association rule. This is computed as the support of an item set divided by the support of the subset of items in antecedent.
first.rules <- apriori(groceries,
parameter = list(support = 0.001, confidence = 0.05))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.05 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[55 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [54 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 done [0.01s].
## writing ... [69921 rule(s)] done [0.01s].
## creating S4 object ... done [0.02s].
print(summary(first.rules)) # yields 69,921 rules... too many
## set of 69921 rules
##
## rule length distribution (lhs + rhs):sizes
## 1 2 3 4 5 6 7 8
## 21 1205 10467 23895 22560 9888 1813 72
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 4.000 4.502 5.000 8.000
##
## summary of quality measures:
## support confidence lift
## Min. :0.001017 Min. :0.0500 Min. : 0.4475
## 1st Qu.:0.001118 1st Qu.:0.2110 1st Qu.: 1.8315
## Median :0.001525 Median :0.4231 Median : 2.2573
## Mean :0.002488 Mean :0.4364 Mean : 2.5382
## 3rd Qu.:0.002339 3rd Qu.:0.6269 3rd Qu.: 2.9662
## Max. :0.443010 Max. :1.0000 Max. :16.1760
##
## mining info:
## data ntransactions support confidence
## groceries 9835 0.001 0.05
# select association rules using thresholds for support and confidence
second.rules <- apriori(groceries,
parameter = list(support = 0.025, confidence = 0.05))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.05 0.1 1 none FALSE TRUE 5 0.025 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 245
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[55 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [32 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [344 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
print(summary(second.rules))
## set of 344 rules
##
## rule length distribution (lhs + rhs):sizes
## 1 2 3 4
## 21 162 129 32
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 2.0 2.0 2.5 3.0 4.0
##
## summary of quality measures:
## support confidence lift
## Min. :0.02542 Min. :0.05043 Min. :0.6669
## 1st Qu.:0.03030 1st Qu.:0.18202 1st Qu.:1.2498
## Median :0.03854 Median :0.39522 Median :1.4770
## Mean :0.05276 Mean :0.37658 Mean :1.4831
## 3rd Qu.:0.05236 3rd Qu.:0.51271 3rd Qu.:1.7094
## Max. :0.44301 Max. :0.79841 Max. :2.4073
##
## mining info:
## data ntransactions support confidence
## groceries 9835 0.025 0.05
plot(second.rules,
control=list(jitter=2, col = rev(brewer.pal(9, "Greens")[4:9])),
shading = "lift")
plot(second.rules, method="grouped",
control=list(col = rev(brewer.pal(9, "Greens")[4:9])))
vegie.rules <- subset(second.rules, subset = rhs %pin% "vegetables")
inspect(vegie.rules) # 41 rules
## lhs rhs support confidence lift
## [1] {} => {vegetables} 0.27300458 0.2730046 1.0000000
## [2] {poultry} => {vegetables} 0.02897814 0.5745968 2.1047148
## [3] {pork} => {vegetables} 0.03009659 0.5220459 1.9122238
## [4] {staple foods} => {vegetables} 0.02613116 0.5160643 1.8903136
## [5] {eggs} => {vegetables} 0.03141840 0.4951923 1.8138608
## [6] {games/books/hobby} => {vegetables} 0.02785968 0.3145809 1.1522918
## [7] {long-life bakery products} => {vegetables} 0.02907982 0.3492063 1.2791227
## [8] {perfumery} => {vegetables} 0.03213015 0.4056483 1.4858662
## [9] {beef} => {vegetables} 0.04585663 0.5595533 2.0496116
## [10] {bags} => {vegetables} 0.03141840 0.3175745 1.1632571
## [11] {vinegar/oils} => {vegetables} 0.04199288 0.4666667 1.7093731
## [12] {chocolate} => {vegetables} 0.03192679 0.2934579 1.0749195
## [13] {beer} => {vegetables} 0.03406202 0.2189542 0.8020168
## [14] {frozen foods} => {vegetables} 0.04738180 0.4052174 1.4842879
## [15] {cheese} => {vegetables} 0.05531266 0.4365971 1.5992300
## [16] {sausage} => {vegetables} 0.07625826 0.4032258 1.4769929
## [17] {fruit} => {vegetables} 0.10706660 0.4297959 1.5743176
## [18] {non-alc. drinks} => {vegetables} 0.09456024 0.2974097 1.0893944
## [19] {bread and backed goods} => {vegetables} 0.11621759 0.3363743 1.2321198
## [20] {dairy produce} => {vegetables} 0.17041179 0.3846683 1.4090180
## [21] {beef,
## dairy produce} => {vegetables} 0.02989324 0.6074380 2.2250104
## [22] {dairy produce,
## vinegar/oils} => {vegetables} 0.03141840 0.5355286 1.9616103
## [23] {dairy produce,
## frozen foods} => {vegetables} 0.03436706 0.5121212 1.8758704
## [24] {cheese,
## fruit} => {vegetables} 0.02674123 0.5197628 1.9038613
## [25] {bread and backed goods,
## cheese} => {vegetables} 0.02887646 0.4536741 1.6617821
## [26] {cheese,
## dairy produce} => {vegetables} 0.04219624 0.4987981 1.8270686
## [27] {fruit,
## sausage} => {vegetables} 0.03426538 0.5290424 1.9378517
## [28] {non-alc. drinks,
## sausage} => {vegetables} 0.03029995 0.4156206 1.5223944
## [29] {bread and backed goods,
## sausage} => {vegetables} 0.04382308 0.4229637 1.5492916
## [30] {dairy produce,
## sausage} => {vegetables} 0.05266904 0.4905303 1.7967842
## [31] {fruit,
## non-alc. drinks} => {vegetables} 0.04361973 0.4657980 1.7061914
## [32] {bread and backed goods,
## fruit} => {vegetables} 0.05124555 0.4763705 1.7449177
## [33] {dairy produce,
## fruit} => {vegetables} 0.07869853 0.5032510 1.8433793
## [34] {bread and backed goods,
## non-alc. drinks} => {vegetables} 0.04636502 0.3731588 1.3668590
## [35] {dairy produce,
## non-alc. drinks} => {vegetables} 0.06446365 0.4243641 1.5544213
## [36] {bread and backed goods,
## dairy produce} => {vegetables} 0.08195221 0.4366197 1.5993128
## [37] {dairy produce,
## fruit,
## sausage} => {vegetables} 0.02714794 0.5741935 2.1032378
## [38] {bread and backed goods,
## dairy produce,
## sausage} => {vegetables} 0.03284189 0.5135135 1.8809704
## [39] {dairy produce,
## fruit,
## non-alc. drinks} => {vegetables} 0.03304525 0.5183413 1.8986543
## [40] {bread and backed goods,
## dairy produce,
## fruit} => {vegetables} 0.04077275 0.5276316 1.9326840
## [41] {bread and backed goods,
## dairy produce,
## non-alc. drinks} => {vegetables} 0.03345196 0.4627286 1.6949480
# sort by lift and identify the top 10 rules
top.vegie.rules <- head(sort(vegie.rules, decreasing = TRUE, by = "lift"), 10)
inspect(top.vegie.rules)
## lhs rhs support confidence lift
## [1] {beef,
## dairy produce} => {vegetables} 0.02989324 0.6074380 2.225010
## [2] {poultry} => {vegetables} 0.02897814 0.5745968 2.104715
## [3] {dairy produce,
## fruit,
## sausage} => {vegetables} 0.02714794 0.5741935 2.103238
## [4] {beef} => {vegetables} 0.04585663 0.5595533 2.049612
## [5] {dairy produce,
## vinegar/oils} => {vegetables} 0.03141840 0.5355286 1.961610
## [6] {fruit,
## sausage} => {vegetables} 0.03426538 0.5290424 1.937852
## [7] {bread and backed goods,
## dairy produce,
## fruit} => {vegetables} 0.04077275 0.5276316 1.932684
## [8] {pork} => {vegetables} 0.03009659 0.5220459 1.912224
## [9] {cheese,
## fruit} => {vegetables} 0.02674123 0.5197628 1.903861
## [10] {dairy produce,
## fruit,
## non-alc. drinks} => {vegetables} 0.03304525 0.5183413 1.898654
plot(top.vegie.rules, method="graph",
control=list(type="items"),
shading = "lift")