The output that follows was generated using the arules package in R.
In the lesson that follows, we use the Groceries (a data
set from the arules package) which contains 1 month (30
days) of real-world point-of-sale transaction data from a local grocery
outlet. The grocery store would like to find actionable, explainable and
non-trivial rules based on Association Analysis of their data.
There are 9835 total transactions, T, in the data set and 169 items.
Summary information about the data set, such as: the support count for the most frequent items, the total number of transactions, the number of items and the frequency based on the number of items per transaction can help us to understand the data.
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels level2 level1
## 1 frankfurter sausage meat and sausage
## 2 sausage sausage meat and sausage
## 3 liver loaf sausage meat and sausage
We can visualize the distribution of the transaction width – or the number of items per transaction. The bar plot below shows us the frequency of transactions by transaction width. We see this distribution is positive skewed, with most transactions having few items.
A preview of the first 6 transactions in the data set, displayed as itemsets is below.
## items
## [1] {citrus fruit,
## semi-finished bread,
## margarine,
## ready soups}
## [2] {tropical fruit,
## yogurt,
## coffee}
## [3] {whole milk}
## [4] {pip fruit,
## yogurt,
## cream cheese ,
## meat spreads}
## [5] {other vegetables,
## whole milk,
## condensed milk,
## long life bakery product}
## [6] {whole milk,
## butter,
## yogurt,
## rice,
## abrasive cleaner}
The support count and support for the first 6 items in the Groceries data is shown below.
| x | |
|---|---|
| frankfurter | 580 |
| sausage | 924 |
| liver loaf | 50 |
| ham | 256 |
| meat | 254 |
| finished products | 64 |
| x | |
|---|---|
| frankfurter | 0.0589731 |
| sausage | 0.0939502 |
| liver loaf | 0.0050839 |
| ham | 0.0260295 |
| meat | 0.0258261 |
| finished products | 0.0065074 |
We can also visualize the 10 most frequent items in our transaction data and their support values.
After exploring and describing our data, we can perform Association Analysis. We will use support = 0.005 and confidence = 0.5.
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.005 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 49
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [120 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [120 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
This creates 120 total rules. We can also get basket sizes and descriptive statistic information for the support, confidence, lift, and count of our rules.
## set of 120 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 1 98 21
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 3.000 3.167 3.000 4.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.005084 Min. :0.5000 Min. :0.008134 Min. :1.957
## 1st Qu.:0.005669 1st Qu.:0.5181 1st Qu.:0.010142 1st Qu.:2.091
## Median :0.006202 Median :0.5445 Median :0.011490 Median :2.249
## Mean :0.007344 Mean :0.5537 Mean :0.013404 Mean :2.379
## 3rd Qu.:0.007982 3rd Qu.:0.5762 3rd Qu.:0.014667 3rd Qu.:2.643
## Max. :0.022267 Max. :0.7000 Max. :0.043416 Max. :3.691
## count
## Min. : 50.00
## 1st Qu.: 55.75
## Median : 61.00
## Mean : 72.22
## 3rd Qu.: 78.50
## Max. :219.00
##
## mining info:
## data ntransactions support confidence
## Groceries 9835 0.005 0.5
## call
## apriori(data = Groceries, parameter = list(target = "rules", support = 0.005, confidence = 0.5, minlen = 2))
The 10 rules with the highest support values sorted in decreasing order are shown below.
## lhs rhs support
## [1] {other vegetables, yogurt} => {whole milk} 0.02226741
## [2] {tropical fruit, yogurt} => {whole milk} 0.01514997
## [3] {other vegetables, whipped/sour cream} => {whole milk} 0.01464159
## [4] {root vegetables, yogurt} => {whole milk} 0.01453991
## [5] {pip fruit, other vegetables} => {whole milk} 0.01352313
## [6] {root vegetables, yogurt} => {other vegetables} 0.01291307
## [7] {root vegetables, rolls/buns} => {whole milk} 0.01270971
## [8] {other vegetables, domestic eggs} => {whole milk} 0.01230300
## [9] {tropical fruit, root vegetables} => {other vegetables} 0.01230300
## [10] {root vegetables, rolls/buns} => {other vegetables} 0.01220132
## confidence coverage lift count
## [1] 0.5128806 0.04341637 2.007235 219
## [2] 0.5173611 0.02928317 2.024770 149
## [3] 0.5070423 0.02887646 1.984385 144
## [4] 0.5629921 0.02582613 2.203354 143
## [5] 0.5175097 0.02613116 2.025351 133
## [6] 0.5000000 0.02582613 2.584078 127
## [7] 0.5230126 0.02430097 2.046888 125
## [8] 0.5525114 0.02226741 2.162336 121
## [9] 0.5845411 0.02104728 3.020999 121
## [10] 0.5020921 0.02430097 2.594890 120
The 10 rules with the highest confidence values sorted in decreasing order are shown below.
## lhs rhs support confidence coverage lift count
## [1] {tropical fruit,
## root vegetables,
## yogurt} => {whole milk} 0.005693950 0.7000000 0.008134215 2.739554 56
## [2] {pip fruit,
## root vegetables,
## other vegetables} => {whole milk} 0.005490595 0.6750000 0.008134215 2.641713 54
## [3] {butter,
## whipped/sour cream} => {whole milk} 0.006710727 0.6600000 0.010167768 2.583008 66
## [4] {pip fruit,
## whipped/sour cream} => {whole milk} 0.005998983 0.6483516 0.009252669 2.537421 59
## [5] {butter,
## yogurt} => {whole milk} 0.009354347 0.6388889 0.014641586 2.500387 92
## [6] {root vegetables,
## butter} => {whole milk} 0.008235892 0.6377953 0.012913066 2.496107 81
## [7] {tropical fruit,
## curd} => {whole milk} 0.006507372 0.6336634 0.010269446 2.479936 64
## [8] {citrus fruit,
## root vegetables,
## whole milk} => {other vegetables} 0.005795628 0.6333333 0.009150991 3.273165 57
## [9] {pip fruit,
## other vegetables,
## yogurt} => {whole milk} 0.005083884 0.6250000 0.008134215 2.446031 50
## [10] {pip fruit,
## domestic eggs} => {whole milk} 0.005388917 0.6235294 0.008642603 2.440275 53
The 10 rules with the highest lift values sorted in decreasing order by are shown below
## lhs rhs support confidence coverage lift count
## [1] {tropical fruit,
## curd} => {yogurt} 0.005287239 0.5148515 0.010269446 3.690645 52
## [2] {citrus fruit,
## root vegetables,
## whole milk} => {other vegetables} 0.005795628 0.6333333 0.009150991 3.273165 57
## [3] {pip fruit,
## root vegetables,
## whole milk} => {other vegetables} 0.005490595 0.6136364 0.008947636 3.171368 54
## [4] {pip fruit,
## whipped/sour cream} => {other vegetables} 0.005592272 0.6043956 0.009252669 3.123610 55
## [5] {root vegetables,
## onions} => {other vegetables} 0.005693950 0.6021505 0.009456024 3.112008 56
## [6] {citrus fruit,
## root vegetables} => {other vegetables} 0.010371124 0.5862069 0.017691917 3.029608 102
## [7] {tropical fruit,
## root vegetables,
## whole milk} => {other vegetables} 0.007015760 0.5847458 0.011997966 3.022057 69
## [8] {tropical fruit,
## root vegetables} => {other vegetables} 0.012302999 0.5845411 0.021047280 3.020999 121
## [9] {butter,
## whipped/sour cream} => {other vegetables} 0.005795628 0.5700000 0.010167768 2.945849 57
## [10] {tropical fruit,
## whipped/sour cream} => {other vegetables} 0.007829181 0.5661765 0.013828165 2.926088 77
We can view rules meeting a particular criteria, such as those rules with lift values greater than 3.
## lhs rhs support confidence coverage lift count
## [1] {root vegetables,
## onions} => {other vegetables} 0.005693950 0.6021505 0.009456024 3.112008 56
## [2] {tropical fruit,
## curd} => {yogurt} 0.005287239 0.5148515 0.010269446 3.690645 52
## [3] {pip fruit,
## whipped/sour cream} => {other vegetables} 0.005592272 0.6043956 0.009252669 3.123610 55
## [4] {citrus fruit,
## root vegetables} => {other vegetables} 0.010371124 0.5862069 0.017691917 3.029608 102
## [5] {tropical fruit,
## root vegetables} => {other vegetables} 0.012302999 0.5845411 0.021047280 3.020999 121
## [6] {pip fruit,
## root vegetables,
## whole milk} => {other vegetables} 0.005490595 0.6136364 0.008947636 3.171368 54
## [7] {citrus fruit,
## root vegetables,
## whole milk} => {other vegetables} 0.005795628 0.6333333 0.009150991 3.273165 57
## [8] {tropical fruit,
## root vegetables,
## whole milk} => {other vegetables} 0.007015760 0.5847458 0.011997966 3.022057 69
We can also view subsets based on multiple criteria, such as items that they include.
For instance, the grocery store may want to come up with promotions and advertising strategies for selling whole milk. We can create a subset based on the right-hand side (Y) including whole milk and the lift value being greater than 2.5, meaning that dependency strongly exists among some itemset X and whole milk.
## lhs rhs support confidence coverage lift count
## [1] {butter,
## whipped/sour cream} => {whole milk} 0.006710727 0.6600000 0.010167768 2.583008 66
## [2] {butter,
## yogurt} => {whole milk} 0.009354347 0.6388889 0.014641586 2.500387 92
## [3] {pip fruit,
## whipped/sour cream} => {whole milk} 0.005998983 0.6483516 0.009252669 2.537421 59
## [4] {pip fruit,
## root vegetables,
## other vegetables} => {whole milk} 0.005490595 0.6750000 0.008134215 2.641713 54
## [5] {tropical fruit,
## root vegetables,
## yogurt} => {whole milk} 0.005693950 0.7000000 0.008134215 2.739554 56
As shown, all of the rules have the same lift value and confidence values, but different (albeit, low) support values.
## lhs rhs support confidence coverage lift count
## [1] {butter,
## whipped/sour cream} => {whole milk} 0.006710727 0.6600000 0.010167768 2.583008 66
## [2] {butter,
## yogurt} => {whole milk} 0.009354347 0.6388889 0.014641586 2.500387 92
## [3] {pip fruit,
## whipped/sour cream} => {whole milk} 0.005998983 0.6483516 0.009252669 2.537421 59
## [4] {pip fruit,
## root vegetables,
## other vegetables} => {whole milk} 0.005490595 0.6750000 0.008134215 2.641713 54
## [5] {tropical fruit,
## root vegetables,
## yogurt} => {whole milk} 0.005693950 0.7000000 0.008134215 2.739554 56
We can take a closer look at the rules with the highest lift values.
## lhs rhs support confidence
## [1] {butter, yogurt} => {whole milk} 0.009354347 0.6388889
## [2] {butter, whipped/sour cream} => {whole milk} 0.006710727 0.6600000
## coverage lift count
## [1] 0.01464159 2.500387 92
## [2] 0.01016777 2.583008 66
Based on this, the grocery store may want to run promotions for fruits and vegetables with their whole milk, or adjust the store layout to accommodate the finding that people buy fruits and vegetables (and in the top rule, yogurt) and whole milk together.
We can take a closer look at another subset, based on the item being on the left-hand side (X). Here, we create a subset based on X containing yogurt and the lift value higher than 2.25.
## lhs rhs support confidence coverage lift count
## [1] {curd,
## yogurt} => {whole milk} 0.010066090 0.5823529 0.017285206 2.279125 99
## [2] {butter,
## yogurt} => {whole milk} 0.009354347 0.6388889 0.014641586 2.500387 92
## [3] {root vegetables,
## yogurt} => {other vegetables} 0.012913066 0.5000000 0.025826131 2.584078 127
## [4] {other vegetables,
## yogurt,
## fruit/vegetable juice} => {whole milk} 0.005083884 0.6172840 0.008235892 2.415833 50
## [5] {whole milk,
## yogurt,
## fruit/vegetable juice} => {other vegetables} 0.005083884 0.5376344 0.009456024 2.778578 50
## [6] {whole milk,
## yogurt,
## whipped/sour cream} => {other vegetables} 0.005592272 0.5140187 0.010879512 2.656529 55
## [7] {pip fruit,
## other vegetables,
## yogurt} => {whole milk} 0.005083884 0.6250000 0.008134215 2.446031 50
## [8] {pip fruit,
## whole milk,
## yogurt} => {other vegetables} 0.005083884 0.5319149 0.009557702 2.749019 50
## [9] {tropical fruit,
## root vegetables,
## yogurt} => {whole milk} 0.005693950 0.7000000 0.008134215 2.739554 56
## [10] {tropical fruit,
## other vegetables,
## yogurt} => {whole milk} 0.007625826 0.6198347 0.012302999 2.425816 75
## [11] {tropical fruit,
## whole milk,
## yogurt} => {other vegetables} 0.007625826 0.5033557 0.015149975 2.601421 75
## [12] {root vegetables,
## other vegetables,
## yogurt} => {whole milk} 0.007829181 0.6062992 0.012913066 2.372842 77
## [13] {root vegetables,
## whole milk,
## yogurt} => {other vegetables} 0.007829181 0.5384615 0.014539908 2.782853 77
Next, we can take a closer look at the two rules with the highest lift values.
## lhs rhs support confidence coverage lift count
## [1] {root vegetables,
## whole milk,
## yogurt} => {other vegetables} 0.007829181 0.5384615 0.014539908 2.782853 77
## [2] {whole milk,
## yogurt,
## fruit/vegetable juice} => {other vegetables} 0.005083884 0.5376344 0.009456024 2.778578 50
Based on the top two rules, whole milk and yogurt are being bought with fruits and vegetables (and fruit/vegetable juice, in the 2nd rule), with the top rule being {root vegetables, whole milk, yogurt} -> {other vegetables}. This information can help the grocery store to market to customers meeting this profile. Based on the subsets, it is clear that these milk-based products (whole milk and yogurt) and fruit and vegetable products are purchased together and the grocery store should tailer their marketing strategies to accommodate this finding.