We will use the arules package in the lesson that follows.
install.packages("arules")
Next, we load the arules package for use.
library(arules)
In the lesson that follows, we use the Groceries data set from the arules package, which contains 1 month (30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. The grocery store would like to find actionable, explainable and non-trivial rules based on Association Analysis of the data.
data("Groceries")
We can explore the data using our typical data overview functions, including str() and summary(). The str() function output gives us information about the data, including under @itemInfo, where we see that there are 169 item labels, or 169 unique items in our transactions.
str(Groceries)
## Formal class 'transactions' [package "arules"] with 3 slots
## ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
## .. .. ..@ i : int [1:43367] 13 60 69 78 14 29 98 24 15 29 ...
## .. .. ..@ p : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ...
## .. .. ..@ Dim : int [1:2] 169 9835
## .. .. ..@ Dimnames:List of 2
## .. .. .. ..$ : NULL
## .. .. .. ..$ : NULL
## .. .. ..@ factors : list()
## ..@ itemInfo :'data.frame': 169 obs. of 3 variables:
## .. ..$ labels: chr [1:169] "frankfurter" "sausage" "liver loaf" "ham" ...
## .. ..$ level2: Factor w/ 55 levels "baby food","bags",..: 44 44 44 44 44 44 44 42 42 41 ...
## .. ..$ level1: Factor w/ 10 levels "canned food",..: 6 6 6 6 6 6 6 6 6 6 ...
## ..@ itemsetInfo:'data.frame': 0 obs. of 0 variables
We can use the length() function to obtain T, the total number of transactions in the dataset.
length(Groceries)
## [1] 9835
The summary() function gives us information including the support count for the most frequent items, the total number of transactions, the number of items and the frequency based on the number of items per transaction.
summary(Groceries)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels level2 level1
## 1 frankfurter sausage meat and sausage
## 2 sausage sausage meat and sausage
## 3 liver loaf sausage meat and sausage
By saving the summary() object, we can create a barplot of the lengths component, which gives us an idea of the frequency of transactions with n items.
test <- summary(Groceries)
We can use the test@lengths component to create the barplot. As shown, must of our transactions contain only a few items, with the highest frequency of transactions containing a single item.
barplot(height = test@lengths,
xlab = "Frequency of Number of Items per Transaction",
cex.names = .5)
We can use the inspect() function to view the data as itemsets. We use the head() function to view the first 6 itemsets.
inspect(head(Groceries))
## items
## [1] {citrus fruit,
## semi-finished bread,
## margarine,
## ready soups}
## [2] {tropical fruit,
## yogurt,
## coffee}
## [3] {whole milk}
## [4] {pip fruit,
## yogurt,
## cream cheese ,
## meat spreads}
## [5] {other vegetables,
## whole milk,
## condensed milk,
## long life bakery product}
## [6] {whole milk,
## butter,
## yogurt,
## rice,
## abrasive cleaner}
The itemFrequency() function from the arules package returns the support values (default) for the items.
We can set type = "absolute" to view the support count. Below, we view the support count, or the number of transactions containing for the first item.
itemFrequency(x = Groceries,
type = "absolute")[1] # 1st item
## frankfurter
## 580
We can set type = "relative" (default) to view the support value (support count / total number of transactions)
itemFrequency(x = Groceries,
type = "relative")[1] # 1st item
## frankfurter
## 0.05897306
We can view the support for the first 6 items using the head() function.
head(itemFrequency(x = Groceries))
## frankfurter sausage liver loaf ham
## 0.058973055 0.093950178 0.005083884 0.026029487
## meat finished products
## 0.025826131 0.006507372
The itemFrequencyPlot() function in the arules packages allows us to view the item frequency as a barplot. Below, we restrict the plot to include only the top 10 most frequent items using using the topN argument.
Support Count (type = "absolute")
itemFrequencyPlot(x = Groceries,
type = "absolute",
topN = 10)
Support (type = "relative", default)
itemFrequencyPlot(x = Groceries,
type = "relative",
topN = 10)
We use the apriori() function in the arules package to perform Association Analysis. By default, the minsup is set to 0.1 (support = 0.1) and the minconf is set to 0.8 (confidence = 0.8. We will use support = 0.005 and confidence = 0.5. We set minlen = 2 to avoid getting rules containing empty itemsets.
rules <- apriori(data = Groceries,
parameter = list(target = "rules",
support = 0.005,
confidence = 0.5,
minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.005 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 49
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [120 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.01s].
## writing ... [120 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Running a code line with the name of the object will tell us how many rules were created.
rules
## set of 120 rules
The summary() function will give us the basket sizes, as well as descriptive statistic information for the support, confidence, lift and count of our rules.
summary(rules)
## set of 120 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 1 98 21
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 3.000 3.167 3.000 4.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.005084 Min. :0.5000 Min. :0.008134 Min. :1.957
## 1st Qu.:0.005669 1st Qu.:0.5181 1st Qu.:0.010142 1st Qu.:2.091
## Median :0.006202 Median :0.5445 Median :0.011490 Median :2.249
## Mean :0.007344 Mean :0.5537 Mean :0.013404 Mean :2.379
## 3rd Qu.:0.007982 3rd Qu.:0.5762 3rd Qu.:0.014667 3rd Qu.:2.643
## Max. :0.022267 Max. :0.7000 Max. :0.043416 Max. :3.691
## count
## Min. : 50.00
## 1st Qu.: 55.75
## Median : 61.00
## Mean : 72.22
## 3rd Qu.: 78.50
## Max. :219.00
##
## mining info:
## data ntransactions support confidence
## Groceries 9835 0.005 0.5
We can use the inspect() and head() functions to view the first 6 rules.
inspect(head(rules))
## lhs rhs support
## [1] {baking powder} => {whole milk} 0.009252669
## [2] {other vegetables,oil} => {whole milk} 0.005083884
## [3] {root vegetables,onions} => {other vegetables} 0.005693950
## [4] {onions,whole milk} => {other vegetables} 0.006609049
## [5] {other vegetables,hygiene articles} => {whole milk} 0.005185562
## [6] {other vegetables,sugar} => {whole milk} 0.006304016
## confidence coverage lift count
## [1] 0.5229885 0.017691917 2.046793 91
## [2] 0.5102041 0.009964413 1.996760 50
## [3] 0.6021505 0.009456024 3.112008 56
## [4] 0.5462185 0.012099644 2.822942 65
## [5] 0.5425532 0.009557702 2.123363 51
## [6] 0.5849057 0.010777834 2.289115 62
We can view the 10 rules with the highest support values sorted in decreasing order by using the sort() function and specifying by = "support".
inspect(head(sort(rules, by = "support",
decreasing = TRUE),
n = 10))
## lhs rhs support
## [1] {other vegetables,yogurt} => {whole milk} 0.02226741
## [2] {tropical fruit,yogurt} => {whole milk} 0.01514997
## [3] {other vegetables,whipped/sour cream} => {whole milk} 0.01464159
## [4] {root vegetables,yogurt} => {whole milk} 0.01453991
## [5] {pip fruit,other vegetables} => {whole milk} 0.01352313
## [6] {root vegetables,yogurt} => {other vegetables} 0.01291307
## [7] {root vegetables,rolls/buns} => {whole milk} 0.01270971
## [8] {other vegetables,domestic eggs} => {whole milk} 0.01230300
## [9] {tropical fruit,root vegetables} => {other vegetables} 0.01230300
## [10] {root vegetables,rolls/buns} => {other vegetables} 0.01220132
## confidence coverage lift count
## [1] 0.5128806 0.04341637 2.007235 219
## [2] 0.5173611 0.02928317 2.024770 149
## [3] 0.5070423 0.02887646 1.984385 144
## [4] 0.5629921 0.02582613 2.203354 143
## [5] 0.5175097 0.02613116 2.025351 133
## [6] 0.5000000 0.02582613 2.584078 127
## [7] 0.5230126 0.02430097 2.046888 125
## [8] 0.5525114 0.02226741 2.162336 121
## [9] 0.5845411 0.02104728 3.020999 121
## [10] 0.5020921 0.02430097 2.594890 120
We can view the 10 rules with the highest confidence values sorted in decreasing order by using the sort() function and specifying by = "confidence".
inspect(head(sort(rules,
by = "confidence",
decreasing = TRUE),
n = 10))
## lhs rhs support confidence coverage lift count
## [1] {tropical fruit,
## root vegetables,
## yogurt} => {whole milk} 0.005693950 0.7000000 0.008134215 2.739554 56
## [2] {pip fruit,
## root vegetables,
## other vegetables} => {whole milk} 0.005490595 0.6750000 0.008134215 2.641713 54
## [3] {butter,
## whipped/sour cream} => {whole milk} 0.006710727 0.6600000 0.010167768 2.583008 66
## [4] {pip fruit,
## whipped/sour cream} => {whole milk} 0.005998983 0.6483516 0.009252669 2.537421 59
## [5] {butter,
## yogurt} => {whole milk} 0.009354347 0.6388889 0.014641586 2.500387 92
## [6] {root vegetables,
## butter} => {whole milk} 0.008235892 0.6377953 0.012913066 2.496107 81
## [7] {tropical fruit,
## curd} => {whole milk} 0.006507372 0.6336634 0.010269446 2.479936 64
## [8] {citrus fruit,
## root vegetables,
## whole milk} => {other vegetables} 0.005795628 0.6333333 0.009150991 3.273165 57
## [9] {pip fruit,
## other vegetables,
## yogurt} => {whole milk} 0.005083884 0.6250000 0.008134215 2.446031 50
## [10] {pip fruit,
## domestic eggs} => {whole milk} 0.005388917 0.6235294 0.008642603 2.440275 53
Finally, we can view the 10 rules with the highest lift values sorted in decreasing order by using the sort() function and specifying by = "lift".
inspect(head(sort(rules,
by = "lift",
decreasing = TRUE),
n = 10))
## lhs rhs support confidence coverage lift count
## [1] {tropical fruit,
## curd} => {yogurt} 0.005287239 0.5148515 0.010269446 3.690645 52
## [2] {citrus fruit,
## root vegetables,
## whole milk} => {other vegetables} 0.005795628 0.6333333 0.009150991 3.273165 57
## [3] {pip fruit,
## root vegetables,
## whole milk} => {other vegetables} 0.005490595 0.6136364 0.008947636 3.171368 54
## [4] {pip fruit,
## whipped/sour cream} => {other vegetables} 0.005592272 0.6043956 0.009252669 3.123610 55
## [5] {root vegetables,
## onions} => {other vegetables} 0.005693950 0.6021505 0.009456024 3.112008 56
## [6] {citrus fruit,
## root vegetables} => {other vegetables} 0.010371124 0.5862069 0.017691917 3.029608 102
## [7] {tropical fruit,
## root vegetables,
## whole milk} => {other vegetables} 0.007015760 0.5847458 0.011997966 3.022057 69
## [8] {tropical fruit,
## root vegetables} => {other vegetables} 0.012302999 0.5845411 0.021047280 3.020999 121
## [9] {butter,
## whipped/sour cream} => {other vegetables} 0.005795628 0.5700000 0.010167768 2.945849 57
## [10] {tropical fruit,
## whipped/sour cream} => {other vegetables} 0.007829181 0.5661765 0.013828165 2.926088 77
We can also use the subset() function within the inspect() function to view rules meeting a particular criteria, such as those rules with lift values greater than 3.
inspect(subset(rules, lift > 3))
## lhs rhs support confidence coverage lift count
## [1] {root vegetables,
## onions} => {other vegetables} 0.005693950 0.6021505 0.009456024 3.112008 56
## [2] {tropical fruit,
## curd} => {yogurt} 0.005287239 0.5148515 0.010269446 3.690645 52
## [3] {pip fruit,
## whipped/sour cream} => {other vegetables} 0.005592272 0.6043956 0.009252669 3.123610 55
## [4] {citrus fruit,
## root vegetables} => {other vegetables} 0.010371124 0.5862069 0.017691917 3.029608 102
## [5] {tropical fruit,
## root vegetables} => {other vegetables} 0.012302999 0.5845411 0.021047280 3.020999 121
## [6] {pip fruit,
## root vegetables,
## whole milk} => {other vegetables} 0.005490595 0.6136364 0.008947636 3.171368 54
## [7] {citrus fruit,
## root vegetables,
## whole milk} => {other vegetables} 0.005795628 0.6333333 0.009150991 3.273165 57
## [8] {tropical fruit,
## root vegetables,
## whole milk} => {other vegetables} 0.007015760 0.5847458 0.011997966 3.022057 69
We can also view subsets based on multiple criteria, such as items that they include.
For instance, the grocery store may want to come up with promotions and advertising strategies for selling whole milk. We can create a subset based on the right-hand side (Y) including whole milk and the lift value being greater than 2.5, meaning that dependency strongly exists among some itemset X and whole milk.
wholemilk.rhs <- subset(rules,
subset = rhs %in% "whole milk" &
lift > 2.5)
We can use the inspect() function on our subset to better understand the rules meeting our criteria. As shown, all of the rules have the same lift value and confidence values, but different (albeit, low) support values.
inspect(wholemilk.rhs)
## lhs rhs support confidence coverage lift count
## [1] {butter,
## whipped/sour cream} => {whole milk} 0.006710727 0.6600000 0.010167768 2.583008 66
## [2] {butter,
## yogurt} => {whole milk} 0.009354347 0.6388889 0.014641586 2.500387 92
## [3] {pip fruit,
## whipped/sour cream} => {whole milk} 0.005998983 0.6483516 0.009252669 2.537421 59
## [4] {pip fruit,
## root vegetables,
## other vegetables} => {whole milk} 0.005490595 0.6750000 0.008134215 2.641713 54
## [5] {tropical fruit,
## root vegetables,
## yogurt} => {whole milk} 0.005693950 0.7000000 0.008134215 2.739554 56
We can take a closer look at the rules with the highest lift values by sorting the subset in decreasing order of lift and isolating the first two observation of the sorted data.
inspect(head(sort(wholemilk.rhs,
by = "support",
decreasing = TRUE))[1:2])
## lhs rhs support confidence
## [1] {butter,yogurt} => {whole milk} 0.009354347 0.6388889
## [2] {butter,whipped/sour cream} => {whole milk} 0.006710727 0.6600000
## coverage lift count
## [1] 0.01464159 2.500387 92
## [2] 0.01016777 2.583008 66
Based on this, the grocery store may want to run promotions for fruits and vegetables with their whole milk, or adjust the store layout to accomodate the finding that people buy fruits and vegetables (and in the top rule, yogurt) and whole milk together.
We can take a closer look at another subset, based on the item being on the left-hand side (X). Here, we create a subset based on X containing yogurt and the lift value higher than 2.25.
yogurt.lhs <- subset(rules,
subset = lhs %in% "yogurt" &
lift > 2.25)
We can use the inspect() function on our subset to better understand the rules meeting our criteria.
inspect(yogurt.lhs)
## lhs rhs support confidence coverage lift count
## [1] {curd,
## yogurt} => {whole milk} 0.010066090 0.5823529 0.017285206 2.279125 99
## [2] {butter,
## yogurt} => {whole milk} 0.009354347 0.6388889 0.014641586 2.500387 92
## [3] {root vegetables,
## yogurt} => {other vegetables} 0.012913066 0.5000000 0.025826131 2.584078 127
## [4] {other vegetables,
## yogurt,
## fruit/vegetable juice} => {whole milk} 0.005083884 0.6172840 0.008235892 2.415833 50
## [5] {whole milk,
## yogurt,
## fruit/vegetable juice} => {other vegetables} 0.005083884 0.5376344 0.009456024 2.778578 50
## [6] {whole milk,
## yogurt,
## whipped/sour cream} => {other vegetables} 0.005592272 0.5140187 0.010879512 2.656529 55
## [7] {pip fruit,
## other vegetables,
## yogurt} => {whole milk} 0.005083884 0.6250000 0.008134215 2.446031 50
## [8] {pip fruit,
## whole milk,
## yogurt} => {other vegetables} 0.005083884 0.5319149 0.009557702 2.749019 50
## [9] {tropical fruit,
## root vegetables,
## yogurt} => {whole milk} 0.005693950 0.7000000 0.008134215 2.739554 56
## [10] {tropical fruit,
## other vegetables,
## yogurt} => {whole milk} 0.007625826 0.6198347 0.012302999 2.425816 75
## [11] {tropical fruit,
## whole milk,
## yogurt} => {other vegetables} 0.007625826 0.5033557 0.015149975 2.601421 75
## [12] {root vegetables,
## other vegetables,
## yogurt} => {whole milk} 0.007829181 0.6062992 0.012913066 2.372842 77
## [13] {root vegetables,
## whole milk,
## yogurt} => {other vegetables} 0.007829181 0.5384615 0.014539908 2.782853 77
Next, we can take a closer look at the two rules with the highest lift values.
inspect(head(sort(yogurt.lhs,
by = "lift",
decreasing = TRUE))[1:2])
## lhs rhs support confidence coverage lift count
## [1] {root vegetables,
## whole milk,
## yogurt} => {other vegetables} 0.007829181 0.5384615 0.014539908 2.782853 77
## [2] {whole milk,
## yogurt,
## fruit/vegetable juice} => {other vegetables} 0.005083884 0.5376344 0.009456024 2.778578 50
Based on the top two rules, whole milk and yogurt are being bought with fruits and vegetables (and fruit/vegetable juice, in the 2nd rule), with the top rule being {root vegetables, whole milk, yogurt} -> {other vegetables}. This information can help the grocery store to market to customers meeting this profile. Based on the subsets, it is clear that these milk-based products (whole milk and yogurt) and fruit and vegetable products are purchased together and the grocery store should tailer their marketing strategies to accomodate this finding.