This report presents a complete, end-to-end implementation of association rules mining applied to the Online Retail II dataset. The source data are stored at invoice-line level, where each row represents a single product occurrence within an invoice rather than a market basket directly.
For mining, invoices are transformed into binary baskets. The key methodological choice is to build baskets from StockCode, which is a more stable product identifier than free-text descriptions. Canonical descriptions are reattached only for readability in the final outputs. Single-item invoices are also removed, because they cannot generate association rules and would otherwise dilute support estimates.
In addition to standard basket-size profiling, the report includes a light temporal profile by month and day of week. This is not sequential pattern mining, but it provides context for seasonality and helps interpret why some supports are structurally higher in specific periods.
The analysis focuses on discovering frequent itemsets and association rules that capture systematic purchasing relationships. To evaluate the strength and relevance of the extracted rules, we rely on three standard metrics:
The Online Retail II dataset is publicly available from the UCI Machine Learning Repository. It contains transactional data from a UK-based online retailer selling unique all-occasion giftware, covering all invoice-line transactions recorded between 01/12/2009 and 09/12/2011.
Data source: https://archive.ics.uci.edu/dataset/352/online+retail
We keep only completed sales lines: drop missing invoice or product
description, keep positive quantities, and remove credit notes / returns
(invoices starting with “C”). Product descriptions are uppercased and
whitespace-normalised to stabilise reporting labels, while
StockCode is retained as the primary product identifier
used later for basket construction. InvoiceDate is
preserved for temporal profiling.
raw <- read_excel("online_retail_II.xlsx")
df <- raw %>%
filter(
!is.na(Invoice),
!is.na(Description),
Quantity > 0,
!str_detect(as.character(Invoice), "^C")
) %>%
mutate(
Invoice = as.character(Invoice),
StockCode = as.character(StockCode),
Description = str_squish(str_to_upper(Description)),
`Customer ID` = as.integer(`Customer ID`),
Country = as.character(Country)
) %>%
select(Invoice, StockCode, Description, Quantity, InvoiceDate, Price,
`Customer ID`, Country)
saveRDS(df, file.path("data", "online_retail_clean.rds"))
cat(sprintf("Rows: %d | Invoices: %d | Items: %d\n",
nrow(df), n_distinct(df$Invoice), n_distinct(df$Description)))
#> Rows: 512033 | Invoices: 21002 | Items: 4515df %>% count(Description, sort = TRUE) %>% slice_head(n = 20)
#> # A tibble: 20 × 2
#> Description n
#> <chr> <int>
#> 1 WHITE HANGING HEART T-LIGHT HOLDER 3456
#> 2 REGENCY CAKESTAND 3 TIER 2046
#> 3 STRAWBERRY CERAMIC TRINKET BOX 1714
#> 4 PACK OF 72 RETRO SPOT CAKE CASES 1456
#> 5 ASSORTED COLOUR BIRD ORNAMENT 1450
#> 6 60 TEATIME FAIRY CAKE CASES 1394
#> 7 HOME BUILDING BLOCK WORD 1376
#> 8 JUMBO BAG RED RETROSPOT 1280
#> 9 LUNCH BAG RED SPOTTY 1246
#> 10 REX CASH+CARRY JUMBO SHOPPER 1226
#> 11 JUMBO STORAGE BAG SUKI 1203
#> 12 PACK OF 60 PINK PAISLEY CAKE CASES 1191
#> 13 WOODEN FRAME ANTIQUE WHITE 1169
#> 14 LUNCH BAG BLACK SKULL. 1156
#> 15 LUNCH BAG SUKI DESIGN 1146
#> 16 HEART OF WICKER LARGE 1145
#> 17 LOVE BUILDING BLOCK WORD 1129
#> 18 RED HANGING HEART T-LIGHT HOLDER 1106
#> 19 JUMBO SHOPPER VINTAGE RED PAISLEY 1085
#> 20 JUMBO BAG STRAWBERRY 1078The top-20 frequency table reveals strong demand concentration. Items such as WHITE HANGING HEART T-LIGHT HOLDER or REGENCY CAKESTAND 3 TIER appear in thousands of invoice lines. Rules involving globally popular items - especially on the RHS - can achieve high confidence even without strong conditional dependence, reinforcing the importance of interpreting confidence jointly with lift.
df %>%
count(Invoice, name = "n") %>%
summarise(mean = mean(n), median = median(n),
p90 = quantile(n, .9), p99 = quantile(n, .99))
#> # A tibble: 1 × 4
#> mean median p90 p99
#> <dbl> <dbl> <dbl> <dbl>
#> 1 24.4 15 51 180The distribution has a heavy right tail: a small number of very large
invoices can generate a disproportionate number of co-occurrences and
inflate higher-order patterns. This motivates the maxlen
constraints and conservative support thresholds used in the mining
stage.
monthly_profile <- df %>%
mutate(month = format(InvoiceDate, "%Y-%m")) %>%
distinct(Invoice, month) %>%
count(month, name = "n_invoices") %>%
arrange(month)
weekday_profile <- df %>%
mutate(weekday = wday(InvoiceDate, label = TRUE, week_start = 1)) %>%
distinct(Invoice, weekday) %>%
count(weekday, name = "n_invoices")
monthly_profile
#> # A tibble: 13 × 2
#> month n_invoices
#> <chr> <int>
#> 1 2009-12 1682
#> 2 2010-01 1106
#> 3 2010-02 1203
#> 4 2010-03 1687
#> 5 2010-04 1465
#> 6 2010-05 1504
#> 7 2010-06 1652
#> 8 2010-07 1535
#> 9 2010-08 1427
#> 10 2010-09 1845
#> 11 2010-10 2304
#> 12 2010-11 2755
#> 13 2010-12 837
weekday_profile
#> # A tibble: 7 × 2
#> weekday n_invoices
#> <ord> <int>
#> 1 Mon 3331
#> 2 Tue 3830
#> 3 Wed 3744
#> 4 Thu 4306
#> 5 Fri 3042
#> 6 Sat 30
#> 7 Sun 2719Monthly invoice counts provide a quick check for holiday concentration and broader demand regimes, while weekday counts expose the retailer’s operating rhythm. This matters because support is not purely a product property; it is partly shaped by when the store is active and how concentrated demand is across the calendar.
Association rules operate on binary baskets: an item is either
present or absent in an invoice, regardless of quantity. We therefore
de-duplicate (Invoice, StockCode) pairs before building the
transaction matrix. StockCode gives a stable item identity,
while descriptions are mapped back afterwards to keep the report
readable. We also discard baskets with fewer than 2 items, because they
cannot contribute to rule mining.
item_labels <- df %>%
count(StockCode, Description, sort = TRUE) %>%
distinct(StockCode, .keep_all = TRUE) %>%
select(StockCode, Description)
# Only substitute description when it maps to exactly one StockCode;
# shared descriptions (e.g. product variants) keep the StockCode as label.
unique_descs <- item_labels %>%
count(Description) %>%
filter(n == 1) %>%
pull(Description)
item_labels <- item_labels %>%
mutate(label = if_else(Description %in% unique_descs, Description, StockCode))
baskets <- df %>%
distinct(Invoice, StockCode) %>%
group_by(Invoice) %>%
summarise(items = list(StockCode), .groups = "drop")
trans <- as(baskets$items, "transactions")
label_map <- setNames(item_labels$label, item_labels$StockCode)
current_labels <- itemLabels(trans)
itemLabels(trans) <- label_map[current_labels]
trans <- trans[size(trans) >= MIN_BASKET]
saveRDS(trans, file.path("data", "online_retail_transactions.rds"))
summary(trans)
#> transactions as itemMatrix in sparse format with
#> 18835 rows (elements/itemsets/transactions) and
#> 4252 columns (items) and a density of 0.006201289
#>
#> most frequent items:
#> 85123A REGENCY CAKESTAND 3 TIER
#> 3246 1986
#> 85099B PACK OF 72 RETRO SPOT CAKE CASES
#> 1950 1851
#> STRAWBERRY CERAMIC TRINKET BOX (Other)
#> 1636 485970
#>
#> element (itemset/transaction) length distribution:
#> sizes
#> 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
#> 876 726 682 647 617 593 550 610 543 574 560 553 526 542 520 513 527 494 493 440
#> 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
#> 397 369 323 312 292 269 261 248 238 204 178 174 160 153 151 147 131 139 155 136
#> 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
#> 110 109 102 90 83 112 85 83 83 78 78 60 54 63 59 61 56 45 48 42
#> 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
#> 46 33 41 44 28 23 27 26 33 36 21 24 26 22 17 14 17 17 17 14
#> 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
#> 10 13 18 11 9 13 13 19 17 6 6 9 8 13 8 9 8 9 13 9
#> 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121
#> 3 7 7 8 6 8 8 7 8 7 3 9 8 10 11 8 6 4 4 8
#> 122 123 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142
#> 4 7 5 3 7 2 4 5 3 7 8 6 10 4 6 4 3 2 2 4
#> 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
#> 5 2 4 6 4 6 3 1 6 3 3 4 4 4 2 5 2 2 4 6
#> 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 180 181 182 183
#> 3 3 6 4 3 4 3 2 2 1 2 4 1 4 1 3 2 4 2 2
#> 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203
#> 4 1 2 3 1 2 3 2 3 3 2 2 1 2 1 1 4 4 3 2
#> 204 205 206 207 211 212 213 216 217 218 219 220 221 222 223 224 225 228 229 233
#> 1 2 2 2 1 1 1 2 1 5 2 3 2 1 2 2 1 2 1 2
#> 237 238 240 241 242 243 245 248 249 250 253 254 255 257 261 263 264 266 267 268
#> 1 1 2 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 2 1
#> 271 272 274 275 276 279 284 285 295 296 299 307 315 316 320 323 325 332 335 337
#> 1 2 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1
#> 340 341 342 343 344 355 358 363 368 369 372 376 379 384 400 407 412 416 420 425
#> 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1
#> 427 429 436 438 439 441 447 449 459 460 463 465 466 467 476 479 480 481 485 486
#> 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 497 498 499 501 507 514 516 523 536 545 546 557 568 577 578 586 589 590 595 601
#> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1
#> 648 674
#> 1 1
#>
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 2.00 9.00 17.00 26.37 30.00 674.00
#>
#> includes extended item information - examples:
#> labels
#> 1 INFLATABLE POLITICAL GLOBE
#> 2 ROBOT PENCIL SHARPNER
#> 3 GROOVY CACTUS INFLATABLEThe resulting transaction matrix remains highly sparse, which is typical for retail data. Most potential item combinations never occur, so strong rules are rare by construction. Filtering out single-item baskets removes observations that add no rule information while slightly improving the interpretability of support.
itemFrequencyPlot(trans, topN = 20, type = "absolute")This plot shows the most frequent items by transaction count (true support counts). Very frequent items can dominate high-confidence rules (many baskets contain them), which is why lift is essential for interpreting rule quality.
Both algorithms use the same thresholds (supp = 0.01,
maxlen = 3). In this report, Apriori and Eclat are used as
an internal consistency check rather than as two separate analytical
outputs: matching counts confirm that the discovered frequent itemsets
do not depend on the implementation.
fis_apriori <- apriori(trans, parameter = list(
target = "frequent itemsets", supp = SUPP_FIS, maxlen = MAXLEN_FIS
))
#> Apriori
#>
#> Parameter specification:
#> confidence minval smax arem aval originalSupport maxtime support minlen
#> NA 0.1 1 none FALSE TRUE 5 0.01 1
#> maxlen target ext
#> 3 frequent itemsets TRUE
#>
#> Algorithmic control:
#> filter tree heap memopt load sort verbose
#> 0.1 TRUE TRUE FALSE TRUE 2 TRUE
#>
#> Absolute minimum support count: 188
#>
#> set item appearances ...[0 item(s)] done [0.00s].
#> set transactions ...[4238 item(s), 18835 transaction(s)] done [0.10s].
#> sorting and recoding items ... [779 item(s)] done [0.01s].
#> creating transaction tree ... done [0.00s].
#> checking subsets of size 1 2 3
#> done [0.02s].
#> sorting transactions ... done [0.00s].
#> writing ... [1699 set(s)] done [0.00s].
#> creating S4 object ... done [0.00s].
fis_eclat <- eclat(trans, parameter = list(
supp = SUPP_FIS, maxlen = MAXLEN_FIS
))
#> Eclat
#>
#> parameter specification:
#> tidLists support minlen maxlen target ext
#> FALSE 0.01 1 3 frequent itemsets TRUE
#>
#> algorithmic control:
#> sparse sort verbose
#> 7 -2 TRUE
#>
#> Absolute minimum support count: 188
#>
#> create itemset ...
#> set transactions ...[4238 item(s), 18835 transaction(s)] done [0.10s].
#> sorting and recoding items ... [779 item(s)] done [0.01s].
#> creating sparse bit matrix ... [779 row(s), 18835 column(s)] done [0.00s].
#> writing ... [1699 set(s)] done [0.61s].
#> Creating S4 object ... done [0.00s].
cat(sprintf("Itemsets - Apriori: %d | Eclat: %d\n",
length(fis_apriori), length(fis_eclat)))
#> Itemsets - Apriori: 1699 | Eclat: 1699
stopifnot(length(fis_apriori) == length(fis_eclat))
inspect(head(sort(fis_apriori, by = "support"), 20))
#> items support count
#> [1] {85123A} 0.17233873 3246
#> [2] {REGENCY CAKESTAND 3 TIER} 0.10544200 1986
#> [3] {85099B} 0.10353066 1950
#> [4] {PACK OF 72 RETRO SPOT CAKE CASES} 0.09827449 1851
#> [5] {STRAWBERRY CERAMIC TRINKET BOX} 0.08685957 1636
#> [6] {LUNCH BAG RED SPOTTY} 0.08170958 1539
#> [7] {ASSORTED COLOUR BIRD ORNAMENT} 0.07454208 1404
#> [8] {60 TEATIME FAIRY CAKE CASES} 0.07087868 1335
#> [9] {HOME BUILDING BLOCK WORD} 0.07045394 1327
#> [10] {SET/20 RED SPOTTY PAPER NAPKINS} 0.06254314 1178
#> [11] {JUMBO STORAGE BAG SUKI} 0.06249005 1177
#> [12] {SET/5 RED SPOTTY LID GLASS BOWLS} 0.06233077 1174
#> [13] {PACK OF 60 PINK PAISLEY CAKE CASES} 0.06126891 1154
#> [14] {LUNCH BAG SUKI DESIGN} 0.06100345 1149
#> [15] {RETRO SPOT TEA SET CERAMIC 11 PC} 0.06020706 1134
#> [16] {82494L} 0.05898593 1111
#> [17] {LUNCH BAG BLACK SKULL.} 0.05898593 1111
#> [18] {HEART OF WICKER LARGE} 0.05803026 1093
#> [19] {LOVE BUILDING BLOCK WORD} 0.05771171 1087
#> [20] {RED HANGING HEART T-LIGHT HOLDER} 0.05643748 1063For rules we use a lower support threshold (0.005) to capture less frequent cross-sell patterns. After mining we retain only rules with lift at least 1.5, which removes near-random co-occurrences and keeps only materially positive associations. Redundant rules are then removed so that the final rule set is not inflated by longer antecedents expressing the same signal.
rules_raw <- apriori(trans, parameter = list(
supp = SUPP_RULES, conf = CONF_RULES, minlen = 2L, maxlen = MAXLEN_RULES
))
#> Apriori
#>
#> Parameter specification:
#> confidence minval smax arem aval originalSupport maxtime support minlen
#> 0.2 0.1 1 none FALSE TRUE 5 0.005 2
#> maxlen target ext
#> 4 rules TRUE
#>
#> Algorithmic control:
#> filter tree heap memopt load sort verbose
#> 0.1 TRUE TRUE FALSE TRUE 2 TRUE
#>
#> Absolute minimum support count: 94
#>
#> set item appearances ...[0 item(s)] done [0.00s].
#> set transactions ...[4238 item(s), 18835 transaction(s)] done [0.10s].
#> sorting and recoding items ... [1434 item(s)] done [0.01s].
#> creating transaction tree ... done [0.01s].
#> checking subsets of size 1 2 3 4
#> done [0.06s].
#> writing ... [15377 rule(s)] done [0.00s].
#> creating S4 object ... done [0.01s].
rules <- rules_raw[quality(rules_raw)$lift >= LIFT_MIN]
rules <- rules[!is.redundant(rules)]
cat(sprintf("Rules raw: %d | After filtering: %d\n",
length(rules_raw), length(rules)))
#> Rules raw: 15377 | After filtering: 14957
saveRDS(fis_apriori, file.path("data", "fis_apriori.rds"))
saveRDS(fis_eclat, file.path("data", "fis_eclat.rds"))
saveRDS(rules, file.path("data", "rules_final.rds"))cat("Top 20 by lift:\n")
#> Top 20 by lift:
inspect(head(sort(rules, by = "lift"), 20))
#> lhs rhs support confidence coverage lift count
#> [1] {KIDS RAIN MAC BLUE} => {KIDS RAIN MAC PINK} 0.005521635 0.8524590 0.006477303 124.46562 104
#> [2] {KIDS RAIN MAC PINK} => {KIDS RAIN MAC BLUE} 0.005521635 0.8062016 0.006848951 124.46562 104
#> [3] {CHRISTMAS TREE DECORATION WITH BELL,
#> CHRISTMAS TREE HEART DECORATION} => {CHRISTMAS TREE STAR DECORATION} 0.005149987 0.8660714 0.005946376 114.87645 97
#> [4] {CHRISTMAS TREE HEART DECORATION,
#> CHRISTMAS TREE STAR DECORATION} => {CHRISTMAS TREE DECORATION WITH BELL} 0.005149987 0.8508772 0.006052562 107.55887 97
#> [5] {CAST IRON HOOK GARDEN FORK} => {CAST IRON HOOK GARDEN TROWEL} 0.007273693 0.8838710 0.008229360 107.40458 137
#> [6] {CAST IRON HOOK GARDEN TROWEL} => {CAST IRON HOOK GARDEN FORK} 0.007273693 0.8838710 0.008229360 107.40458 137
#> [7] {CHRISTMAS TREE DECORATION WITH BELL,
#> CHRISTMAS TREE STAR DECORATION} => {CHRISTMAS TREE HEART DECORATION} 0.005149987 0.8660714 0.005946376 101.95285 97
#> [8] {CHRISTMAS TREE STAR DECORATION} => {CHRISTMAS TREE DECORATION WITH BELL} 0.005946376 0.7887324 0.007539156 99.70319 112
#> [9] {CHRISTMAS TREE DECORATION WITH BELL} => {CHRISTMAS TREE STAR DECORATION} 0.005946376 0.7516779 0.007910804 99.70319 112
#> [10] {CHRISTMAS TREE STAR DECORATION} => {CHRISTMAS TREE HEART DECORATION} 0.006052562 0.8028169 0.007539156 94.50660 114
#> [11] {CHRISTMAS TREE HEART DECORATION} => {CHRISTMAS TREE STAR DECORATION} 0.006052562 0.7125000 0.008494823 94.50660 114
#> [12] {CHRISTMAS TREE HEART DECORATION} => {CHRISTMAS TREE DECORATION WITH BELL} 0.005946376 0.7000000 0.008494823 88.48658 112
#> [13] {CHRISTMAS TREE DECORATION WITH BELL} => {CHRISTMAS TREE HEART DECORATION} 0.005946376 0.7516779 0.007910804 88.48658 112
#> [14] {CHILDS GARDEN TROWEL BLUE,
#> CHILDS GARDEN FORK PINK,
#> CHILDRENS GARDEN GLOVES BLUE} => {CHILDS GARDEN FORK BLUE} 0.005043801 0.9500000 0.005309265 86.86044 95
#> [15] {CHILDS GARDEN FORK PINK,
#> CHILDRENS GARDEN GLOVES BLUE} => {CHILDS GARDEN FORK BLUE} 0.005362357 0.9439252 0.005680913 86.30501 101
#> [16] {CHILDS GARDEN TROWEL BLUE,
#> CHILDS GARDEN FORK PINK,
#> CHILDRENS GARDEN GLOVES PINK} => {CHILDS GARDEN FORK BLUE} 0.005309265 0.9433962 0.005627821 86.25664 100
#> [17] {CHILDS GARDEN TROWEL BLUE,
#> CHILDS GARDEN TROWEL PINK,
#> CHILDS GARDEN FORK PINK} => {CHILDS GARDEN FORK BLUE} 0.008282453 0.9397590 0.008813379 85.92408 156
#> [18] {CHILDS GARDEN TROWEL BLUE,
#> CHILDS GARDEN FORK PINK} => {CHILDS GARDEN FORK BLUE} 0.008388638 0.9294118 0.009025750 84.97801 158
#> [19] {84559B} => {84559A} 0.005096894 0.6486486 0.007857712 84.84234 96
#> [20] {84559A} => {84559B} 0.005096894 0.6666667 0.007645341 84.84234 96
cat("\nTop 20 by confidence:\n")
#>
#> Top 20 by confidence:
inspect(head(sort(rules, by = "confidence"), 20))
#> lhs rhs support confidence coverage lift count
#> [1] {18098c} => {DOTCOM POSTAGE} 0.006902044 1.0000000 0.006902044 25.69577 130
#> [2] {CHILDS GARDEN TROWEL BLUE,
#> CHILDS GARDEN FORK PINK,
#> CHILDRENS GARDEN GLOVES BLUE} => {CHILDS GARDEN TROWEL PINK} 0.005309265 1.0000000 0.005309265 70.01859 100
#> [3] {CHILDS GARDEN TROWEL BLUE,
#> CHILDRENS GARDEN GLOVES BLUE,
#> CHILDRENS GARDEN GLOVES PINK} => {CHILDS GARDEN TROWEL PINK} 0.006264932 1.0000000 0.006264932 70.01859 118
#> [4] {CHILDS GARDEN TROWEL BLUE,
#> CHILDRENS GARDEN GLOVES PINK} => {CHILDS GARDEN TROWEL PINK} 0.006742766 0.9921875 0.006795859 69.47157 127
#> [5] {CHILDS GARDEN TROWEL BLUE,
#> CHILDS GARDEN FORK BLUE,
#> CHILDS GARDEN FORK PINK} => {CHILDS GARDEN TROWEL PINK} 0.008282453 0.9873418 0.008388638 69.13228 156
#> [6] {CHILDS GARDEN TROWEL PINK,
#> CHILDS GARDEN FORK BLUE,
#> CHILDS GARDEN FORK PINK} => {CHILDS GARDEN TROWEL BLUE} 0.008282453 0.9873418 0.008388638 72.36024 156
#> [7] {CHILDS GARDEN TROWEL PINK,
#> CHILDS GARDEN FORK BLUE,
#> CHILDRENS GARDEN GLOVES PINK} => {CHILDS GARDEN TROWEL BLUE} 0.005468543 0.9809524 0.005574728 71.89198 103
#> [8] {STRAWBERRY CERAMIC TRINKET BOX,
#> ENVELOPE 50 ROMANTIC IMAGES} => {DOTCOM POSTAGE} 0.005415450 0.9807692 0.005521635 25.20162 102
#> [9] {PACK OF 72 RETRO SPOT CAKE CASES,
#> CHARLOTTE BAG , PINK/WHITE SPOTS,
#> ANTIQUE SILVER TEA GLASS ETCHED} => {DOTCOM POSTAGE} 0.005415450 0.9807692 0.005521635 25.20162 102
#> [10] {PACK OF 72 RETRO SPOT CAKE CASES,
#> SMALL GLASS HEART TRINKET POT,
#> ANTIQUE SILVER TEA GLASS ETCHED} => {DOTCOM POSTAGE} 0.005203079 0.9800000 0.005309265 25.18186 98
#> [11] {CHILDS GARDEN TROWEL BLUE,
#> CHILDS GARDEN FORK PINK} => {CHILDS GARDEN TROWEL PINK} 0.008813379 0.9764706 0.009025750 68.37109 166
#> [12] {POPPY'S PLAYHOUSE BEDROOM,
#> POPPY'S PLAYHOUSE LIVINGROOM,
#> POPPY'S PLAYHOUSE BATHROOM} => {POPPY'S PLAYHOUSE KITCHEN} 0.008547916 0.9757576 0.008760287 53.73799 161
#> [13] {CHILDS GARDEN TROWEL PINK,
#> CHILDS GARDEN FORK PINK,
#> CHILDRENS GARDEN GLOVES BLUE} => {CHILDS GARDEN TROWEL BLUE} 0.005309265 0.9708738 0.005468543 71.15334 100
#> [14] {CHILDS GARDEN FORK PINK,
#> CHILDRENS GARDEN GLOVES BLUE,
#> CHILDRENS GARDEN GLOVES PINK} => {CHILDS GARDEN TROWEL PINK} 0.005309265 0.9708738 0.005468543 67.97921 100
#> [15] {CHILDS GARDEN TROWEL PINK,
#> CHILDS GARDEN FORK PINK,
#> CHILDRENS GARDEN GLOVES BLUE} => {CHILDRENS GARDEN GLOVES PINK} 0.005309265 0.9708738 0.005468543 68.74589 100
#> [16] {CHILDS GARDEN TROWEL PINK,
#> CHILDS GARDEN FORK BLUE,
#> CHILDRENS GARDEN GLOVES BLUE} => {CHILDS GARDEN TROWEL BLUE} 0.005256172 0.9705882 0.005415450 71.13241 99
#> [17] {CHILDS GARDEN TROWEL PINK,
#> CHILDS GARDEN FORK BLUE} => {CHILDS GARDEN TROWEL BLUE} 0.008707194 0.9704142 0.008972657 71.11966 164
#> [18] {RED SPOTTY CHARLOTTE BAG,
#> STRAWBERRY CERAMIC TRINKET BOX,
#> SMALL GLASS HEART TRINKET POT} => {DOTCOM POSTAGE} 0.005203079 0.9702970 0.005362357 24.93253 98
#> [19] {CHILDS GARDEN TROWEL BLUE,
#> CHILDS GARDEN FORK PINK,
#> CHILDRENS GARDEN GLOVES BLUE} => {CHILDRENS GARDEN GLOVES PINK} 0.005149987 0.9700000 0.005309265 68.68402 97
#> [20] {PACK OF 72 RETRO SPOT CAKE CASES,
#> SWEETHEART CERAMIC TRINKET BOX,
#> ANTIQUE SILVER TEA GLASS ETCHED} => {DOTCOM POSTAGE} 0.005149987 0.9700000 0.005309265 24.92490 97Top rules by lift expose tightly coupled product pairs (e.g. complementary tools, matched seasonal decorations). High-lift rules are niche in absolute frequency but exceptionally strong in relative terms and well-suited for targeted recommendations. High-confidence rules that also carry high lift are analytically reliable - the pattern is not driven purely by a popular RHS item.
At this stage the rules are already filtered for minimum lift and non-redundancy. To make the output presentation-friendly we focus on rules with a single-item RHS, which are the easiest to communicate and deploy. We then produce three views: highest-lift bundles, highest-confidence predictors, and cross-sell candidates with meaningful commercial scale.
rules_rhs1 <- rules[size(rhs(rules)) == 1]
q <- quality(rules_rhs1)
rules_bundle <- head(sort(rules_rhs1, by = "lift"), 30)
rules_conf <- head(sort(rules_rhs1, by = "confidence"), 30)
rules_xsell <- rules_rhs1[q$support >= 0.01 & q$lift >= 2] %>%
sort(by = "lift") %>%
head(30)
write.csv(as(rules_bundle, "data.frame"), file.path("data", "rules_top_lift.csv"), row.names = FALSE)
write.csv(as(rules_conf, "data.frame"), file.path("data", "rules_top_confidence.csv"), row.names = FALSE)
write.csv(as(rules_xsell, "data.frame"), file.path("data", "rules_cross_sell.csv"), row.names = FALSE)
cat(sprintf("Single-RHS rules: %d\n", length(rules_rhs1)))
#> Single-RHS rules: 14957cat("Top 10 by lift:\n"); inspect(head(rules_bundle, 10))
#> Top 10 by lift:
#> lhs rhs support confidence coverage lift count
#> [1] {KIDS RAIN MAC BLUE} => {KIDS RAIN MAC PINK} 0.005521635 0.8524590 0.006477303 124.46562 104
#> [2] {KIDS RAIN MAC PINK} => {KIDS RAIN MAC BLUE} 0.005521635 0.8062016 0.006848951 124.46562 104
#> [3] {CHRISTMAS TREE DECORATION WITH BELL,
#> CHRISTMAS TREE HEART DECORATION} => {CHRISTMAS TREE STAR DECORATION} 0.005149987 0.8660714 0.005946376 114.87645 97
#> [4] {CHRISTMAS TREE HEART DECORATION,
#> CHRISTMAS TREE STAR DECORATION} => {CHRISTMAS TREE DECORATION WITH BELL} 0.005149987 0.8508772 0.006052562 107.55887 97
#> [5] {CAST IRON HOOK GARDEN FORK} => {CAST IRON HOOK GARDEN TROWEL} 0.007273693 0.8838710 0.008229360 107.40458 137
#> [6] {CAST IRON HOOK GARDEN TROWEL} => {CAST IRON HOOK GARDEN FORK} 0.007273693 0.8838710 0.008229360 107.40458 137
#> [7] {CHRISTMAS TREE DECORATION WITH BELL,
#> CHRISTMAS TREE STAR DECORATION} => {CHRISTMAS TREE HEART DECORATION} 0.005149987 0.8660714 0.005946376 101.95285 97
#> [8] {CHRISTMAS TREE STAR DECORATION} => {CHRISTMAS TREE DECORATION WITH BELL} 0.005946376 0.7887324 0.007539156 99.70319 112
#> [9] {CHRISTMAS TREE DECORATION WITH BELL} => {CHRISTMAS TREE STAR DECORATION} 0.005946376 0.7516779 0.007910804 99.70319 112
#> [10] {CHRISTMAS TREE STAR DECORATION} => {CHRISTMAS TREE HEART DECORATION} 0.006052562 0.8028169 0.007539156 94.50660 114
cat("\nTop 10 by confidence:\n"); inspect(head(rules_conf, 10))
#>
#> Top 10 by confidence:
#> lhs rhs support confidence coverage lift count
#> [1] {18098c} => {DOTCOM POSTAGE} 0.006902044 1.0000000 0.006902044 25.69577 130
#> [2] {CHILDS GARDEN TROWEL BLUE,
#> CHILDS GARDEN FORK PINK,
#> CHILDRENS GARDEN GLOVES BLUE} => {CHILDS GARDEN TROWEL PINK} 0.005309265 1.0000000 0.005309265 70.01859 100
#> [3] {CHILDS GARDEN TROWEL BLUE,
#> CHILDRENS GARDEN GLOVES BLUE,
#> CHILDRENS GARDEN GLOVES PINK} => {CHILDS GARDEN TROWEL PINK} 0.006264932 1.0000000 0.006264932 70.01859 118
#> [4] {CHILDS GARDEN TROWEL BLUE,
#> CHILDRENS GARDEN GLOVES PINK} => {CHILDS GARDEN TROWEL PINK} 0.006742766 0.9921875 0.006795859 69.47157 127
#> [5] {CHILDS GARDEN TROWEL BLUE,
#> CHILDS GARDEN FORK BLUE,
#> CHILDS GARDEN FORK PINK} => {CHILDS GARDEN TROWEL PINK} 0.008282453 0.9873418 0.008388638 69.13228 156
#> [6] {CHILDS GARDEN TROWEL PINK,
#> CHILDS GARDEN FORK BLUE,
#> CHILDS GARDEN FORK PINK} => {CHILDS GARDEN TROWEL BLUE} 0.008282453 0.9873418 0.008388638 72.36024 156
#> [7] {CHILDS GARDEN TROWEL PINK,
#> CHILDS GARDEN FORK BLUE,
#> CHILDRENS GARDEN GLOVES PINK} => {CHILDS GARDEN TROWEL BLUE} 0.005468543 0.9809524 0.005574728 71.89198 103
#> [8] {STRAWBERRY CERAMIC TRINKET BOX,
#> ENVELOPE 50 ROMANTIC IMAGES} => {DOTCOM POSTAGE} 0.005415450 0.9807692 0.005521635 25.20162 102
#> [9] {PACK OF 72 RETRO SPOT CAKE CASES,
#> CHARLOTTE BAG , PINK/WHITE SPOTS,
#> ANTIQUE SILVER TEA GLASS ETCHED} => {DOTCOM POSTAGE} 0.005415450 0.9807692 0.005521635 25.20162 102
#> [10] {PACK OF 72 RETRO SPOT CAKE CASES,
#> SMALL GLASS HEART TRINKET POT,
#> ANTIQUE SILVER TEA GLASS ETCHED} => {DOTCOM POSTAGE} 0.005203079 0.9800000 0.005309265 25.18186 98
cat("\nTop 10 cross-sell (supp ≥ 0.01, lift ≥ 2):\n"); inspect(head(rules_xsell, 10))
#>
#> Top 10 cross-sell (supp ≥ 0.01, lift ≥ 2):
#> lhs rhs support confidence coverage lift count
#> [1] {CHILDS GARDEN TROWEL BLUE} => {CHILDS GARDEN FORK BLUE} 0.01008760 0.7392996 0.01364481 67.59567 190
#> [2] {CHILDS GARDEN FORK BLUE} => {CHILDS GARDEN TROWEL BLUE} 0.01008760 0.9223301 0.01093709 67.59567 190
#> [3] {CHILDS GARDEN TROWEL PINK} => {CHILDS GARDEN FORK PINK} 0.01083090 0.7583643 0.01428192 65.22279 204
#> [4] {CHILDS GARDEN FORK PINK} => {CHILDS GARDEN TROWEL PINK} 0.01083090 0.9315068 0.01162729 65.22279 204
#> [5] {CHILDS GARDEN TROWEL BLUE} => {CHILDS GARDEN TROWEL PINK} 0.01130873 0.8287938 0.01364481 58.03097 213
#> [6] {CHILDS GARDEN TROWEL PINK} => {CHILDS GARDEN TROWEL BLUE} 0.01130873 0.7918216 0.01428192 58.03097 213
#> [7] {POPPY'S PLAYHOUSE BEDROOM,
#> POPPY'S PLAYHOUSE KITCHEN} => {POPPY'S PLAYHOUSE LIVINGROOM} 0.01093709 0.7518248 0.01454739 54.25525 206
#> [8] {POPPY'S PLAYHOUSE LIVINGROOM,
#> POPPY'S PLAYHOUSE KITCHEN} => {POPPY'S PLAYHOUSE BEDROOM} 0.01093709 0.8803419 0.01242368 52.14226 206
#> [9] {POPPY'S PLAYHOUSE LIVINGROOM} => {POPPY'S PLAYHOUSE BEDROOM} 0.01189275 0.8582375 0.01385718 50.83303 224
#> [10] {POPPY'S PLAYHOUSE BEDROOM} => {POPPY'S PLAYHOUSE LIVINGROOM} 0.01189275 0.7044025 0.01688346 50.83303 224How to read a rule row:
support - fraction of all invoices containing both LHS and
RHS;
confidence - how often RHS occurs when LHS is
present;
lift - strength relative to random co-occurrence. A
cross-sell support threshold (≥ 1%) ensures the pattern occurs at
non-trivial scale.
For a compact, deployment-ready rule set we restrict attention to short antecedents (≤ 2 items), keep a single-item RHS, remove operational artifacts (shipping / fees), and enforce balanced support / confidence / lift thresholds. The objective is to avoid both ultra-rare niche bundles and rules that are statistically strong only because they point to globally popular items.
rules_act <- rules[
size(lhs(rules)) <= 2 &
size(rhs(rules)) == 1
]
rules_act <- rules_act[!(lhs(rules_act) %pin% DROP_TERMS)]
rules_act <- rules_act[!(rhs(rules_act) %pin% DROP_TERMS)]
q_act <- quality(rules_act)
rules_act <- rules_act[
q_act$support >= SUPP_ACT &
q_act$confidence >= CONF_ACT &
q_act$lift >= LIFT_LO &
q_act$lift <= LIFT_HI
]
rules_top50 <- head(sort(rules_act, by = "lift"), 50)
cat(sprintf("Actionable candidate rules: %d\n", length(rules_act)))
#> Actionable candidate rules: 280
write.csv(as(rules_top50, "data.frame"), file.path("data", "rules_actionable_top50.csv"), row.names = FALSE)
saveRDS(rules_top50, file.path("data", "rules_actionable_top50.rds"))inspect(head(rules_top50, 20))
#> lhs rhs support confidence coverage lift count
#> [1] {WOODLAND CHARLOTTE BAG,
#> RED SPOTTY CHARLOTTE BAG} => {CHARLOTTE BAG , PINK/WHITE SPOTS} 0.01555615 0.6720183 0.02314839 14.64984 293
#> [2] {RED SPOTTY CHARLOTTE BAG,
#> CHARLOTTE BAG , PINK/WHITE SPOTS} => {STRAWBERRY CHARLOTTE BAG} 0.01598089 0.5271454 0.03031590 14.64422 301
#> [3] {PINK BLUE FELT CRAFT TRINKET BOX} => {PINK CREAM FELT CRAFT TRINKET BOX} 0.02129015 0.6159754 0.03456331 14.28805 401
#> [4] {PINK CREAM FELT CRAFT TRINKET BOX} => {PINK BLUE FELT CRAFT TRINKET BOX} 0.02129015 0.4938424 0.04311123 14.28805 401
#> [5] {WOODLAND CHARLOTTE BAG,
#> CHARLOTTE BAG , PINK/WHITE SPOTS} => {RED SPOTTY CHARLOTTE BAG} 0.01555615 0.7751323 0.02006902 14.11955 293
#> [6] {PLASTERS IN TIN SPACEBOY} => {PLASTERS IN TIN WOODLAND ANIMALS} 0.01582161 0.4467766 0.03541280 13.95529 298
#> [7] {PLASTERS IN TIN WOODLAND ANIMALS} => {PLASTERS IN TIN SPACEBOY} 0.01582161 0.4941957 0.03201487 13.95529 298
#> [8] {CHARLOTTE BAG SUKI DESIGN,
#> CHARLOTTE BAG , PINK/WHITE SPOTS} => {RED SPOTTY CHARLOTTE BAG} 0.01736130 0.7500000 0.02314839 13.66175 327
#> [9] {PLASTERS IN TIN SPACEBOY} => {PLASTERS IN TIN CIRCUS PARADE} 0.01640563 0.4632684 0.03541280 13.38291 309
#> [10] {PLASTERS IN TIN CIRCUS PARADE} => {PLASTERS IN TIN SPACEBOY} 0.01640563 0.4739264 0.03461641 13.38291 309
#> [11] {84970L} => {84970S} 0.02139634 0.6287051 0.03403239 13.08471 403
#> [12] {84970S} => {84970L} 0.02139634 0.4453039 0.04804885 13.08471 403
#> [13] {LARGE POPCORN HOLDER} => {SMALL POPCORN HOLDER} 0.02139634 0.6773109 0.03159012 13.05747 403
#> [14] {SMALL POPCORN HOLDER} => {LARGE POPCORN HOLDER} 0.02139634 0.4124872 0.05187152 13.05747 403
#> [15] {RETRO SPOT LARGE MILK JUG} => {RED RETROSPOT SMALL MILK JUG} 0.01730820 0.5182830 0.03339527 12.69423 326
#> [16] {RED RETROSPOT SMALL MILK JUG} => {RETRO SPOT LARGE MILK JUG} 0.01730820 0.4239272 0.04082825 12.69423 326
#> [17] {RED SPOTTY CHARLOTTE BAG,
#> CHARLOTTE BAG , PINK/WHITE SPOTS} => {WOODLAND CHARLOTTE BAG} 0.01555615 0.5131349 0.03031590 12.65039 293
#> [18] {TOY TIDY PINK RETROSPOT} => {RECYCLING BAG RETROSPOT} 0.01927263 0.5671875 0.03397929 12.64258 363
#> [19] {RECYCLING BAG RETROSPOT} => {TOY TIDY PINK RETROSPOT} 0.01927263 0.4295858 0.04486329 12.64258 363
#> [20] {RED SPOTTY CHARLOTTE BAG,
#> CHARLOTTE BAG , PINK/WHITE SPOTS} => {CHARLOTTE BAG SUKI DESIGN} 0.01736130 0.5726795 0.03031590 12.16056 327The shortlist balances non-trivial support with high relative strength (lift bounded below by 1.5 and above by 15). The upper lift cap removes variant-completion patterns such as tightly matched sets or product variants sold together, which are statistically strong but often weak from a recommendation-design perspective.
plot(rules_top50,
measure = c("support", "confidence"),
shading = "lift")Most rules cluster at low support values, with the highest-lift rules appearing as isolated points. This is consistent with the empirical structure of retail baskets: strong associations tend to be localized, and the lift shading highlights rules that are genuinely stronger than baseline item popularity would predict.
set.seed(42)
rules_net <- head(sort(rules_top50, by = "lift"), 20)
plot(rules_net, method = "graph", engine = "htmlwidget")The graph reveals product clusters (e.g. tightly coupled variants and seasonal bundles) and highlights hub items that frequently appear as RHS. A fixed seed is used so the force-directed layout remains stable across renders, which makes side-by-side comparison and review easier. Such hubs represent natural anchor products for recommendation placement and bundle pricing strategies.
Purchasing behaviour in the Online Retail II dataset is strongly
structured rather than random. By using StockCode as the
underlying item identifier, excluding one-item baskets, profiling
temporal concentration, and applying stricter lift-based filtering, the
analysis is less exposed to noise from description variants, inactive
baskets, and near-random co-occurrences. The resulting rule sets are
therefore more interpretable, more robust, and closer to what could
plausibly support recommendation, bundling, or merchandising
decisions.
required_pkgs <- c("readxl", "dplyr", "stringr", "lubridate", "arules", "arulesViz")
pkg_versions <- tibble::tibble(
package = c("R", required_pkgs),
version = c(
as.character(getRversion()),
vapply(required_pkgs, function(p) as.character(packageVersion(p)), character(1))
)
)
knitr::kable(pkg_versions, caption = "R and package versions used in this run")| package | version |
|---|---|
| R | 4.5.2 |
| readxl | 1.4.5 |
| dplyr | 1.1.4 |
| stringr | 1.6.0 |
| lubridate | 1.9.5 |
| arules | 1.7.13 |
| arulesViz | 1.5.4 |
sessionInfo()
#> R version 4.5.2 (2025-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26200)
#>
#> Matrix products: default
#> LAPACK version 3.12.1
#>
#> locale:
#> [1] LC_COLLATE=English_United States.utf8
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> time zone: Europe/Warsaw
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] arulesViz_1.5.4 arules_1.7.13 Matrix_1.7-4 lubridate_1.9.5
#> [5] stringr_1.6.0 dplyr_1.1.4 readxl_1.4.5
#>
#> loaded via a namespace (and not attached):
#> [1] viridis_0.6.5 utf8_1.2.6 sass_0.4.10 generics_0.1.4
#> [5] tidyr_1.3.2 stringi_1.8.7 lattice_0.22-7 digest_0.6.39
#> [9] magrittr_2.0.4 evaluate_1.0.5 grid_4.5.2 timechange_0.4.0
#> [13] RColorBrewer_1.1-3 fastmap_1.2.0 cellranger_1.1.0 jsonlite_2.0.0
#> [17] ggrepel_0.9.7 gridExtra_2.3 purrr_1.2.1 viridisLite_0.4.3
#> [21] scales_1.4.0 tweenr_2.0.3 jquerylib_0.1.4 cli_3.6.5
#> [25] graphlayouts_1.2.3 rlang_1.1.6 polyclip_1.10-7 visNetwork_2.1.4
#> [29] tidygraph_1.3.1 withr_3.0.2 cachem_1.1.0 yaml_2.3.12
#> [33] otel_0.2.0 tools_4.5.2 memoise_2.0.1 ggplot2_4.0.2
#> [37] vctrs_0.6.5 R6_2.6.1 lifecycle_1.0.5 htmlwidgets_1.6.4
#> [41] MASS_7.3-65 ggraph_2.2.2 pkgconfig_2.0.3 pillar_1.11.1
#> [45] bslib_0.10.0 gtable_0.3.6 Rcpp_1.1.0 glue_1.8.0
#> [49] ggforce_0.5.0 xfun_0.56 tibble_3.3.0 tidyselect_1.2.1
#> [53] rstudioapi_0.18.0 knitr_1.51 farver_2.1.2 htmltools_0.5.9
#> [57] igraph_2.2.2 labeling_0.4.3 rmarkdown_2.30 compiler_4.5.2
#> [61] S7_0.2.1