Our market basket analysis will utilize the purchase data collected from one month of operation at a real-world grocery store. The data contains 9,835 transactions or about 327 transactions per day (roughly 30 transactions per hour in a 12-hour business day), suggesting that the retailer is not particularly large, nor is it particularly small. The dataset used here was adapted from the Groceries dataset in the arules R package.
library(arules)
Loading required package: Matrix
Attaching package: ‘arules’
The following object is masked from ‘package:car’:
recode
The following objects are masked from ‘package:base’:
abbreviate, write
groceries <- read.transactions("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml13/groceries.csv", sep = ",")
summary(groceries)
transactions as itemMatrix in sparse format with
9835 rows (elements/itemsets/transactions) and
169 columns (items) and a density of 0.02609146
most frequent items:
whole milk other vegetables rolls/buns soda yogurt
2513 1903 1809 1715 1372
(Other)
34055
element (itemset/transaction) length distribution:
sizes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46 29 14
19 20 21 22 23 24 26 27 28 29 32
14 9 11 4 6 1 1 1 1 3 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 3.000 4.409 6.000 32.000
includes extended item information - examples:
labels
1 abrasive cleaner
2 artif. sweetener
3 baby cosmetics
inspect(groceries[1:5])
items
[1] {citrus fruit,
margarine,
ready soups,
semi-finished bread}
[2] {coffee,
tropical fruit,
yogurt}
[3] {whole milk}
[4] {cream cheese,
meat spreads,
pip fruit,
yogurt}
[5] {condensed milk,
long life bakery product,
other vegetables,
whole milk}
itemFrequency(groceries[, 1:3])
abrasive cleaner artif. sweetener baby cosmetics
0.0035587189 0.0032536858 0.0006100661
itemFrequencyPlot(groceries, support = 0.1)
itemFrequencyPlot(groceries, topN = 20)
image(groceries[1:5])
image(sample(groceries, 100))
library(arules)
apriori(groceries)
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen maxlen target
0.8 0.1 1 none FALSE TRUE 5 0.1 1 10 rules
ext
FALSE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 983
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
sorting and recoding items ... [8 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 done [0.00s].
writing ... [0 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
set of 0 rules
groceryrules <- apriori(groceries, parameter = list(support =
0.006, confidence = 0.25, minlen = 2))
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen maxlen target
0.25 0.1 1 none FALSE TRUE 5 0.006 2 10 rules
ext
FALSE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 59
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
sorting and recoding items ... [109 item(s)] done [0.00s].
creating transaction tree ... done [0.01s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [463 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
groceryrules
set of 463 rules
# summary of grocery association rules
summary(groceryrules)
set of 463 rules
rule length distribution (lhs + rhs):sizes
2 3 4
150 297 16
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.000 3.000 2.711 3.000 4.000
summary of quality measures:
support confidence lift
Min. :0.006101 Min. :0.2500 Min. :0.9932
1st Qu.:0.007117 1st Qu.:0.2971 1st Qu.:1.6229
Median :0.008744 Median :0.3554 Median :1.9332
Mean :0.011539 Mean :0.3786 Mean :2.0351
3rd Qu.:0.012303 3rd Qu.:0.4495 3rd Qu.:2.3565
Max. :0.074835 Max. :0.6600 Max. :3.9565
mining info:
data ntransactions support confidence
groceries 9835 0.006 0.25
In the final section of the summary() output, we receive mining information, telling us about how the rules were chosen. Here, we see that the groceries data, which contained 9,835 transactions, was used to construct rules with a minimum support of 0.0006 and minimum confidence of 0.25.
The first rule can be read in plain language as, “if a customer buys potted plants, they will also buy whole milk.” With support of 0.007 and confidence of 0.400, we can determine that this rule covers 0.7 percent of the transactions and is correct in 40 percent of purchases involving potted plants. The lift value tells us how much more likely a customer is to buy whole milk relative to the average customer, given that he or she bought a potted plant. Since we know that about 25.6 percent of the customers bought whole milk (support), while 40 percent of the customers buying a potted plant bought whole milk (confidence), we can compute the lift value as 0.40 / 0.256 = 1.56, which matches the value shown.
inspect(groceryrules[1:3])
lhs rhs support confidence lift
[1] {potted plants} => {whole milk} 0.006914082 0.4000000 1.565460
[2] {pasta} => {whole milk} 0.006100661 0.4054054 1.586614
[3] {herbs} => {root vegetables} 0.007015760 0.4312500 3.956477
inspect(sort(groceryrules, by = "lift")[1:5])
lhs rhs support confidence lift
[1] {herbs} => {root vegetables} 0.007015760 0.4312500 3.956477
[2] {berries} => {whipped/sour cream} 0.009049314 0.2721713 3.796886
[3] {other vegetables,
tropical fruit,
whole milk} => {root vegetables} 0.007015760 0.4107143 3.768074
[4] {beef,
other vegetables} => {root vegetables} 0.007930859 0.4020619 3.688692
[5] {other vegetables,
tropical fruit} => {pip fruit} 0.009456024 0.2634561 3.482649
These rules appear to be more interesting than the ones we looked at previously. The first rule, with a lift of about 3.96, implies that people who buy herbs are nearly four times more likely to buy root vegetables than the typical customer—perhaps for a stew of some sort? Rule two is also interesting. Whipped cream is over three times more likely to be found in a shopping cart with berries versus other carts, suggesting perhaps a dessert pairing?
berryrules <- subset(groceryrules, items %in% "berries")
inspect(berryrules)
lhs rhs support confidence lift
[1] {berries} => {whipped/sour cream} 0.009049314 0.2721713 3.796886
[2] {berries} => {yogurt} 0.010574479 0.3180428 2.279848
[3] {berries} => {other vegetables} 0.010269446 0.3088685 1.596280
[4] {berries} => {whole milk} 0.011794611 0.3547401 1.388328
write(groceryrules, file = "groceryrules.csv",
sep = ",", quote = TRUE, row.names = FALSE)
groceryrules_df <- as(groceryrules, "data.frame")
str(groceryrules_df)
'data.frame': 463 obs. of 4 variables:
$ rules : Factor w/ 463 levels "{baking powder} => {other vegetables}",..: 340 302 207 206 208 341 402 21 139 140 ...
$ support : num 0.00691 0.0061 0.00702 0.00773 0.00773 ...
$ confidence: num 0.4 0.405 0.431 0.475 0.475 ...
$ lift : num 1.57 1.59 3.96 2.45 1.86 ...