1 Introduction

This report presents a complete, end-to-end implementation of association rules mining applied to the Online Retail II dataset. The source data are stored at invoice-line level, where each row represents a single product occurrence within an invoice rather than a market basket directly.

For mining, invoices are transformed into binary baskets. The key methodological choice is to build baskets from StockCode, which is a more stable product identifier than free-text descriptions. Canonical descriptions are reattached only for readability in the final outputs. Single-item invoices are also removed, because they cannot generate association rules and would otherwise dilute support estimates.

In addition to standard basket-size profiling, the report includes a light temporal profile by month and day of week. This is not sequential pattern mining, but it provides context for seasonality and helps interpret why some supports are structurally higher in specific periods.

The analysis focuses on discovering frequent itemsets and association rules that capture systematic purchasing relationships. To evaluate the strength and relevance of the extracted rules, we rely on three standard metrics:

1.1 Dataset

The Online Retail II dataset is publicly available from the UCI Machine Learning Repository. It contains transactional data from a UK-based online retailer selling unique all-occasion giftware, covering all invoice-line transactions recorded between 01/12/2009 and 09/12/2011.

Data source: https://archive.ics.uci.edu/dataset/352/online+retail

2 Load and clean

We keep only completed sales lines: drop missing invoice or product description, keep positive quantities, and remove credit notes / returns (invoices starting with “C”). Product descriptions are uppercased and whitespace-normalised to stabilise reporting labels, while StockCode is retained as the primary product identifier used later for basket construction. InvoiceDate is preserved for temporal profiling.

raw <- read_excel("online_retail_II.xlsx")

df <- raw %>%
  filter(
    !is.na(Invoice),
    !is.na(Description),
    Quantity > 0,
    !str_detect(as.character(Invoice), "^C")
  ) %>%
  mutate(
    Invoice       = as.character(Invoice),
    StockCode     = as.character(StockCode),
    Description   = str_squish(str_to_upper(Description)),
    `Customer ID` = as.integer(`Customer ID`),
    Country       = as.character(Country)
  ) %>%
  select(Invoice, StockCode, Description, Quantity, InvoiceDate, Price,
         `Customer ID`, Country)

saveRDS(df, file.path("data", "online_retail_clean.rds"))

cat(sprintf("Rows: %d  |  Invoices: %d  |  Items: %d\n",
            nrow(df), n_distinct(df$Invoice), n_distinct(df$Description)))
#> Rows: 512033  |  Invoices: 21002  |  Items: 4515

3 Data profile

3.1 Top items by line frequency

df %>% count(Description, sort = TRUE) %>% slice_head(n = 20)
#> # A tibble: 20 × 2
#>    Description                            n
#>    <chr>                              <int>
#>  1 WHITE HANGING HEART T-LIGHT HOLDER  3456
#>  2 REGENCY CAKESTAND 3 TIER            2046
#>  3 STRAWBERRY CERAMIC TRINKET BOX      1714
#>  4 PACK OF 72 RETRO SPOT CAKE CASES    1456
#>  5 ASSORTED COLOUR BIRD ORNAMENT       1450
#>  6 60 TEATIME FAIRY CAKE CASES         1394
#>  7 HOME BUILDING BLOCK WORD            1376
#>  8 JUMBO BAG RED RETROSPOT             1280
#>  9 LUNCH BAG RED SPOTTY                1246
#> 10 REX CASH+CARRY JUMBO SHOPPER        1226
#> 11 JUMBO STORAGE BAG SUKI              1203
#> 12 PACK OF 60 PINK PAISLEY CAKE CASES  1191
#> 13 WOODEN FRAME ANTIQUE WHITE          1169
#> 14 LUNCH BAG BLACK SKULL.              1156
#> 15 LUNCH BAG SUKI DESIGN               1146
#> 16 HEART OF WICKER LARGE               1145
#> 17 LOVE BUILDING BLOCK WORD            1129
#> 18 RED HANGING HEART T-LIGHT HOLDER    1106
#> 19 JUMBO SHOPPER VINTAGE RED PAISLEY   1085
#> 20 JUMBO BAG STRAWBERRY                1078

The top-20 frequency table reveals strong demand concentration. Items such as WHITE HANGING HEART T-LIGHT HOLDER or REGENCY CAKESTAND 3 TIER appear in thousands of invoice lines. Rules involving globally popular items - especially on the RHS - can achieve high confidence even without strong conditional dependence, reinforcing the importance of interpreting confidence jointly with lift.

3.2 Invoice-size distribution

df %>%
  count(Invoice, name = "n") %>%
  summarise(mean = mean(n), median = median(n),
            p90  = quantile(n, .9), p99 = quantile(n, .99))
#> # A tibble: 1 × 4
#>    mean median   p90   p99
#>   <dbl>  <dbl> <dbl> <dbl>
#> 1  24.4     15    51   180

The distribution has a heavy right tail: a small number of very large invoices can generate a disproportionate number of co-occurrences and inflate higher-order patterns. This motivates the maxlen constraints and conservative support thresholds used in the mining stage.

3.3 Temporal profile

monthly_profile <- df %>%
  mutate(month = format(InvoiceDate, "%Y-%m")) %>%
  distinct(Invoice, month) %>%
  count(month, name = "n_invoices") %>%
  arrange(month)

weekday_profile <- df %>%
  mutate(weekday = wday(InvoiceDate, label = TRUE, week_start = 1)) %>%
  distinct(Invoice, weekday) %>%
  count(weekday, name = "n_invoices")

monthly_profile
#> # A tibble: 13 × 2
#>    month   n_invoices
#>    <chr>        <int>
#>  1 2009-12       1682
#>  2 2010-01       1106
#>  3 2010-02       1203
#>  4 2010-03       1687
#>  5 2010-04       1465
#>  6 2010-05       1504
#>  7 2010-06       1652
#>  8 2010-07       1535
#>  9 2010-08       1427
#> 10 2010-09       1845
#> 11 2010-10       2304
#> 12 2010-11       2755
#> 13 2010-12        837
weekday_profile
#> # A tibble: 7 × 2
#>   weekday n_invoices
#>   <ord>        <int>
#> 1 Mon           3331
#> 2 Tue           3830
#> 3 Wed           3744
#> 4 Thu           4306
#> 5 Fri           3042
#> 6 Sat             30
#> 7 Sun           2719

Monthly invoice counts provide a quick check for holiday concentration and broader demand regimes, while weekday counts expose the retailer’s operating rhythm. This matters because support is not purely a product property; it is partly shaped by when the store is active and how concentrated demand is across the calendar.

4 Build transactions

Association rules operate on binary baskets: an item is either present or absent in an invoice, regardless of quantity. We therefore de-duplicate (Invoice, StockCode) pairs before building the transaction matrix. StockCode gives a stable item identity, while descriptions are mapped back afterwards to keep the report readable. We also discard baskets with fewer than 2 items, because they cannot contribute to rule mining.

item_labels <- df %>%
  count(StockCode, Description, sort = TRUE) %>%
  distinct(StockCode, .keep_all = TRUE) %>%
  select(StockCode, Description)

# Only substitute description when it maps to exactly one StockCode;
# shared descriptions (e.g. product variants) keep the StockCode as label.
unique_descs <- item_labels %>%
  count(Description) %>%
  filter(n == 1) %>%
  pull(Description)

item_labels <- item_labels %>%
  mutate(label = if_else(Description %in% unique_descs, Description, StockCode))

baskets <- df %>%
  distinct(Invoice, StockCode) %>%
  group_by(Invoice) %>%
  summarise(items = list(StockCode), .groups = "drop")

trans <- as(baskets$items, "transactions")
label_map         <- setNames(item_labels$label, item_labels$StockCode)
current_labels    <- itemLabels(trans)
itemLabels(trans) <- label_map[current_labels]
trans             <- trans[size(trans) >= MIN_BASKET]

saveRDS(trans, file.path("data", "online_retail_transactions.rds"))

summary(trans)
#> transactions as itemMatrix in sparse format with
#>  18835 rows (elements/itemsets/transactions) and
#>  4252 columns (items) and a density of 0.006201289 
#> 
#> most frequent items:
#>                           85123A         REGENCY CAKESTAND 3 TIER 
#>                             3246                             1986 
#>                           85099B PACK OF 72 RETRO SPOT CAKE CASES 
#>                             1950                             1851 
#>   STRAWBERRY CERAMIC TRINKET BOX                          (Other) 
#>                             1636                           485970 
#> 
#> element (itemset/transaction) length distribution:
#> sizes
#>   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21 
#> 876 726 682 647 617 593 550 610 543 574 560 553 526 542 520 513 527 494 493 440 
#>  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41 
#> 397 369 323 312 292 269 261 248 238 204 178 174 160 153 151 147 131 139 155 136 
#>  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61 
#> 110 109 102  90  83 112  85  83  83  78  78  60  54  63  59  61  56  45  48  42 
#>  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81 
#>  46  33  41  44  28  23  27  26  33  36  21  24  26  22  17  14  17  17  17  14 
#>  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 
#>  10  13  18  11   9  13  13  19  17   6   6   9   8  13   8   9   8   9  13   9 
#> 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 
#>   3   7   7   8   6   8   8   7   8   7   3   9   8  10  11   8   6   4   4   8 
#> 122 123 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 
#>   4   7   5   3   7   2   4   5   3   7   8   6  10   4   6   4   3   2   2   4 
#> 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 
#>   5   2   4   6   4   6   3   1   6   3   3   4   4   4   2   5   2   2   4   6 
#> 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 180 181 182 183 
#>   3   3   6   4   3   4   3   2   2   1   2   4   1   4   1   3   2   4   2   2 
#> 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 
#>   4   1   2   3   1   2   3   2   3   3   2   2   1   2   1   1   4   4   3   2 
#> 204 205 206 207 211 212 213 216 217 218 219 220 221 222 223 224 225 228 229 233 
#>   1   2   2   2   1   1   1   2   1   5   2   3   2   1   2   2   1   2   1   2 
#> 237 238 240 241 242 243 245 248 249 250 253 254 255 257 261 263 264 266 267 268 
#>   1   1   2   1   1   1   1   1   2   2   1   1   1   1   1   1   1   1   2   1 
#> 271 272 274 275 276 279 284 285 295 296 299 307 315 316 320 323 325 332 335 337 
#>   1   2   1   1   2   1   1   1   1   1   2   1   1   1   1   1   1   1   1   1 
#> 340 341 342 343 344 355 358 363 368 369 372 376 379 384 400 407 412 416 420 425 
#>   1   1   1   1   1   1   1   1   2   1   1   1   1   1   2   1   1   1   1   1 
#> 427 429 436 438 439 441 447 449 459 460 463 465 466 467 476 479 480 481 485 486 
#>   1   2   1   2   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
#> 497 498 499 501 507 514 516 523 536 545 546 557 568 577 578 586 589 590 595 601 
#>   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2   1   1   1   1   1 
#> 648 674 
#>   1   1 
#> 
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    2.00    9.00   17.00   26.37   30.00  674.00 
#> 
#> includes extended item information - examples:
#>                       labels
#> 1 INFLATABLE POLITICAL GLOBE
#> 2      ROBOT PENCIL SHARPNER
#> 3   GROOVY CACTUS INFLATABLE

The resulting transaction matrix remains highly sparse, which is typical for retail data. Most potential item combinations never occur, so strong rules are rare by construction. Filtering out single-item baskets removes observations that add no rule information while slightly improving the interpretability of support.

4.1 Item frequencies

itemFrequencyPlot(trans, topN = 20, type = "absolute")

This plot shows the most frequent items by transaction count (true support counts). Very frequent items can dominate high-confidence rules (many baskets contain them), which is why lift is essential for interpreting rule quality.

5 Mine frequent itemsets and rules

5.1 Frequent itemsets - Apriori vs Eclat

Both algorithms use the same thresholds (supp = 0.01, maxlen = 3). In this report, Apriori and Eclat are used as an internal consistency check rather than as two separate analytical outputs: matching counts confirm that the discovered frequent itemsets do not depend on the implementation.

fis_apriori <- apriori(trans, parameter = list(
  target = "frequent itemsets", supp = SUPP_FIS, maxlen = MAXLEN_FIS
))
#> Apriori
#> 
#> Parameter specification:
#>  confidence minval smax arem  aval originalSupport maxtime support minlen
#>          NA    0.1    1 none FALSE            TRUE       5    0.01      1
#>  maxlen            target  ext
#>       3 frequent itemsets TRUE
#> 
#> Algorithmic control:
#>  filter tree heap memopt load sort verbose
#>     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
#> 
#> Absolute minimum support count: 188 
#> 
#> set item appearances ...[0 item(s)] done [0.00s].
#> set transactions ...[4238 item(s), 18835 transaction(s)] done [0.10s].
#> sorting and recoding items ... [779 item(s)] done [0.01s].
#> creating transaction tree ... done [0.00s].
#> checking subsets of size 1 2 3
#>  done [0.02s].
#> sorting transactions ... done [0.00s].
#> writing ... [1699 set(s)] done [0.00s].
#> creating S4 object  ... done [0.00s].

fis_eclat <- eclat(trans, parameter = list(
  supp = SUPP_FIS, maxlen = MAXLEN_FIS
))
#> Eclat
#> 
#> parameter specification:
#>  tidLists support minlen maxlen            target  ext
#>     FALSE    0.01      1      3 frequent itemsets TRUE
#> 
#> algorithmic control:
#>  sparse sort verbose
#>       7   -2    TRUE
#> 
#> Absolute minimum support count: 188 
#> 
#> create itemset ... 
#> set transactions ...[4238 item(s), 18835 transaction(s)] done [0.10s].
#> sorting and recoding items ... [779 item(s)] done [0.01s].
#> creating sparse bit matrix ... [779 row(s), 18835 column(s)] done [0.00s].
#> writing  ... [1699 set(s)] done [0.61s].
#> Creating S4 object  ... done [0.00s].

cat(sprintf("Itemsets - Apriori: %d  |  Eclat: %d\n",
            length(fis_apriori), length(fis_eclat)))
#> Itemsets - Apriori: 1699  |  Eclat: 1699

stopifnot(length(fis_apriori) == length(fis_eclat))

inspect(head(sort(fis_apriori, by = "support"), 20))
#>      items                                support    count
#> [1]  {85123A}                             0.17233873 3246 
#> [2]  {REGENCY CAKESTAND 3 TIER}           0.10544200 1986 
#> [3]  {85099B}                             0.10353066 1950 
#> [4]  {PACK OF 72 RETRO SPOT CAKE CASES}   0.09827449 1851 
#> [5]  {STRAWBERRY CERAMIC TRINKET BOX}     0.08685957 1636 
#> [6]  {LUNCH BAG RED SPOTTY}               0.08170958 1539 
#> [7]  {ASSORTED COLOUR BIRD ORNAMENT}      0.07454208 1404 
#> [8]  {60 TEATIME FAIRY CAKE CASES}        0.07087868 1335 
#> [9]  {HOME BUILDING BLOCK WORD}           0.07045394 1327 
#> [10] {SET/20 RED SPOTTY PAPER NAPKINS}    0.06254314 1178 
#> [11] {JUMBO STORAGE BAG SUKI}             0.06249005 1177 
#> [12] {SET/5 RED SPOTTY LID GLASS BOWLS}   0.06233077 1174 
#> [13] {PACK OF 60 PINK PAISLEY CAKE CASES} 0.06126891 1154 
#> [14] {LUNCH BAG SUKI DESIGN}              0.06100345 1149 
#> [15] {RETRO SPOT TEA SET CERAMIC 11 PC}   0.06020706 1134 
#> [16] {82494L}                             0.05898593 1111 
#> [17] {LUNCH BAG BLACK SKULL.}             0.05898593 1111 
#> [18] {HEART OF WICKER LARGE}              0.05803026 1093 
#> [19] {LOVE BUILDING BLOCK WORD}           0.05771171 1087 
#> [20] {RED HANGING HEART T-LIGHT HOLDER}   0.05643748 1063

5.2 Association rules

For rules we use a lower support threshold (0.005) to capture less frequent cross-sell patterns. After mining we retain only rules with lift at least 1.5, which removes near-random co-occurrences and keeps only materially positive associations. Redundant rules are then removed so that the final rule set is not inflated by longer antecedents expressing the same signal.

rules_raw <- apriori(trans, parameter = list(
  supp = SUPP_RULES, conf = CONF_RULES, minlen = 2L, maxlen = MAXLEN_RULES
))
#> Apriori
#> 
#> Parameter specification:
#>  confidence minval smax arem  aval originalSupport maxtime support minlen
#>         0.2    0.1    1 none FALSE            TRUE       5   0.005      2
#>  maxlen target  ext
#>       4  rules TRUE
#> 
#> Algorithmic control:
#>  filter tree heap memopt load sort verbose
#>     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
#> 
#> Absolute minimum support count: 94 
#> 
#> set item appearances ...[0 item(s)] done [0.00s].
#> set transactions ...[4238 item(s), 18835 transaction(s)] done [0.10s].
#> sorting and recoding items ... [1434 item(s)] done [0.01s].
#> creating transaction tree ... done [0.01s].
#> checking subsets of size 1 2 3 4
#>  done [0.06s].
#> writing ... [15377 rule(s)] done [0.00s].
#> creating S4 object  ... done [0.01s].

rules <- rules_raw[quality(rules_raw)$lift >= LIFT_MIN]
rules <- rules[!is.redundant(rules)]

cat(sprintf("Rules raw: %d  |  After filtering: %d\n",
            length(rules_raw), length(rules)))
#> Rules raw: 15377  |  After filtering: 14957

saveRDS(fis_apriori, file.path("data", "fis_apriori.rds"))
saveRDS(fis_eclat,   file.path("data", "fis_eclat.rds"))
saveRDS(rules,       file.path("data", "rules_final.rds"))

5.3 Top rules by lift and confidence

cat("Top 20 by lift:\n")
#> Top 20 by lift:
inspect(head(sort(rules, by = "lift"), 20))
#>      lhs                                       rhs                                       support confidence    coverage      lift count
#> [1]  {KIDS RAIN MAC BLUE}                   => {KIDS RAIN MAC PINK}                  0.005521635  0.8524590 0.006477303 124.46562   104
#> [2]  {KIDS RAIN MAC PINK}                   => {KIDS RAIN MAC BLUE}                  0.005521635  0.8062016 0.006848951 124.46562   104
#> [3]  {CHRISTMAS TREE DECORATION WITH BELL,                                                                                             
#>       CHRISTMAS TREE HEART DECORATION}      => {CHRISTMAS TREE STAR DECORATION}      0.005149987  0.8660714 0.005946376 114.87645    97
#> [4]  {CHRISTMAS TREE HEART DECORATION,                                                                                                 
#>       CHRISTMAS TREE STAR DECORATION}       => {CHRISTMAS TREE DECORATION WITH BELL} 0.005149987  0.8508772 0.006052562 107.55887    97
#> [5]  {CAST IRON HOOK GARDEN FORK}           => {CAST IRON HOOK GARDEN TROWEL}        0.007273693  0.8838710 0.008229360 107.40458   137
#> [6]  {CAST IRON HOOK GARDEN TROWEL}         => {CAST IRON HOOK GARDEN FORK}          0.007273693  0.8838710 0.008229360 107.40458   137
#> [7]  {CHRISTMAS TREE DECORATION WITH BELL,                                                                                             
#>       CHRISTMAS TREE STAR DECORATION}       => {CHRISTMAS TREE HEART DECORATION}     0.005149987  0.8660714 0.005946376 101.95285    97
#> [8]  {CHRISTMAS TREE STAR DECORATION}       => {CHRISTMAS TREE DECORATION WITH BELL} 0.005946376  0.7887324 0.007539156  99.70319   112
#> [9]  {CHRISTMAS TREE DECORATION WITH BELL}  => {CHRISTMAS TREE STAR DECORATION}      0.005946376  0.7516779 0.007910804  99.70319   112
#> [10] {CHRISTMAS TREE STAR DECORATION}       => {CHRISTMAS TREE HEART DECORATION}     0.006052562  0.8028169 0.007539156  94.50660   114
#> [11] {CHRISTMAS TREE HEART DECORATION}      => {CHRISTMAS TREE STAR DECORATION}      0.006052562  0.7125000 0.008494823  94.50660   114
#> [12] {CHRISTMAS TREE HEART DECORATION}      => {CHRISTMAS TREE DECORATION WITH BELL} 0.005946376  0.7000000 0.008494823  88.48658   112
#> [13] {CHRISTMAS TREE DECORATION WITH BELL}  => {CHRISTMAS TREE HEART DECORATION}     0.005946376  0.7516779 0.007910804  88.48658   112
#> [14] {CHILDS GARDEN TROWEL BLUE,                                                                                                       
#>       CHILDS GARDEN FORK PINK,                                                                                                         
#>       CHILDRENS GARDEN GLOVES BLUE}         => {CHILDS GARDEN FORK BLUE}             0.005043801  0.9500000 0.005309265  86.86044    95
#> [15] {CHILDS GARDEN FORK PINK,                                                                                                         
#>       CHILDRENS GARDEN GLOVES BLUE}         => {CHILDS GARDEN FORK BLUE}             0.005362357  0.9439252 0.005680913  86.30501   101
#> [16] {CHILDS GARDEN TROWEL BLUE,                                                                                                       
#>       CHILDS GARDEN FORK PINK,                                                                                                         
#>       CHILDRENS GARDEN GLOVES PINK}         => {CHILDS GARDEN FORK BLUE}             0.005309265  0.9433962 0.005627821  86.25664   100
#> [17] {CHILDS GARDEN TROWEL BLUE,                                                                                                       
#>       CHILDS GARDEN TROWEL PINK,                                                                                                       
#>       CHILDS GARDEN FORK PINK}              => {CHILDS GARDEN FORK BLUE}             0.008282453  0.9397590 0.008813379  85.92408   156
#> [18] {CHILDS GARDEN TROWEL BLUE,                                                                                                       
#>       CHILDS GARDEN FORK PINK}              => {CHILDS GARDEN FORK BLUE}             0.008388638  0.9294118 0.009025750  84.97801   158
#> [19] {84559B}                               => {84559A}                              0.005096894  0.6486486 0.007857712  84.84234    96
#> [20] {84559A}                               => {84559B}                              0.005096894  0.6666667 0.007645341  84.84234    96

cat("\nTop 20 by confidence:\n")
#> 
#> Top 20 by confidence:
inspect(head(sort(rules, by = "confidence"), 20))
#>      lhs                                    rhs                                support confidence    coverage     lift count
#> [1]  {18098c}                            => {DOTCOM POSTAGE}               0.006902044  1.0000000 0.006902044 25.69577   130
#> [2]  {CHILDS GARDEN TROWEL BLUE,                                                                                            
#>       CHILDS GARDEN FORK PINK,                                                                                              
#>       CHILDRENS GARDEN GLOVES BLUE}      => {CHILDS GARDEN TROWEL PINK}    0.005309265  1.0000000 0.005309265 70.01859   100
#> [3]  {CHILDS GARDEN TROWEL BLUE,                                                                                            
#>       CHILDRENS GARDEN GLOVES BLUE,                                                                                         
#>       CHILDRENS GARDEN GLOVES PINK}      => {CHILDS GARDEN TROWEL PINK}    0.006264932  1.0000000 0.006264932 70.01859   118
#> [4]  {CHILDS GARDEN TROWEL BLUE,                                                                                            
#>       CHILDRENS GARDEN GLOVES PINK}      => {CHILDS GARDEN TROWEL PINK}    0.006742766  0.9921875 0.006795859 69.47157   127
#> [5]  {CHILDS GARDEN TROWEL BLUE,                                                                                            
#>       CHILDS GARDEN FORK BLUE,                                                                                              
#>       CHILDS GARDEN FORK PINK}           => {CHILDS GARDEN TROWEL PINK}    0.008282453  0.9873418 0.008388638 69.13228   156
#> [6]  {CHILDS GARDEN TROWEL PINK,                                                                                            
#>       CHILDS GARDEN FORK BLUE,                                                                                              
#>       CHILDS GARDEN FORK PINK}           => {CHILDS GARDEN TROWEL BLUE}    0.008282453  0.9873418 0.008388638 72.36024   156
#> [7]  {CHILDS GARDEN TROWEL PINK,                                                                                            
#>       CHILDS GARDEN FORK BLUE,                                                                                              
#>       CHILDRENS GARDEN GLOVES PINK}      => {CHILDS GARDEN TROWEL BLUE}    0.005468543  0.9809524 0.005574728 71.89198   103
#> [8]  {STRAWBERRY CERAMIC TRINKET BOX,                                                                                       
#>       ENVELOPE 50 ROMANTIC IMAGES}       => {DOTCOM POSTAGE}               0.005415450  0.9807692 0.005521635 25.20162   102
#> [9]  {PACK OF 72 RETRO SPOT CAKE CASES,                                                                                     
#>       CHARLOTTE BAG , PINK/WHITE SPOTS,                                                                                     
#>       ANTIQUE SILVER TEA GLASS ETCHED}   => {DOTCOM POSTAGE}               0.005415450  0.9807692 0.005521635 25.20162   102
#> [10] {PACK OF 72 RETRO SPOT CAKE CASES,                                                                                     
#>       SMALL GLASS HEART TRINKET POT,                                                                                        
#>       ANTIQUE SILVER TEA GLASS ETCHED}   => {DOTCOM POSTAGE}               0.005203079  0.9800000 0.005309265 25.18186    98
#> [11] {CHILDS GARDEN TROWEL BLUE,                                                                                            
#>       CHILDS GARDEN FORK PINK}           => {CHILDS GARDEN TROWEL PINK}    0.008813379  0.9764706 0.009025750 68.37109   166
#> [12] {POPPY'S PLAYHOUSE BEDROOM,                                                                                            
#>       POPPY'S PLAYHOUSE LIVINGROOM,                                                                                         
#>       POPPY'S PLAYHOUSE BATHROOM}        => {POPPY'S PLAYHOUSE KITCHEN}    0.008547916  0.9757576 0.008760287 53.73799   161
#> [13] {CHILDS GARDEN TROWEL PINK,                                                                                            
#>       CHILDS GARDEN FORK PINK,                                                                                              
#>       CHILDRENS GARDEN GLOVES BLUE}      => {CHILDS GARDEN TROWEL BLUE}    0.005309265  0.9708738 0.005468543 71.15334   100
#> [14] {CHILDS GARDEN FORK PINK,                                                                                              
#>       CHILDRENS GARDEN GLOVES BLUE,                                                                                         
#>       CHILDRENS GARDEN GLOVES PINK}      => {CHILDS GARDEN TROWEL PINK}    0.005309265  0.9708738 0.005468543 67.97921   100
#> [15] {CHILDS GARDEN TROWEL PINK,                                                                                            
#>       CHILDS GARDEN FORK PINK,                                                                                              
#>       CHILDRENS GARDEN GLOVES BLUE}      => {CHILDRENS GARDEN GLOVES PINK} 0.005309265  0.9708738 0.005468543 68.74589   100
#> [16] {CHILDS GARDEN TROWEL PINK,                                                                                            
#>       CHILDS GARDEN FORK BLUE,                                                                                              
#>       CHILDRENS GARDEN GLOVES BLUE}      => {CHILDS GARDEN TROWEL BLUE}    0.005256172  0.9705882 0.005415450 71.13241    99
#> [17] {CHILDS GARDEN TROWEL PINK,                                                                                            
#>       CHILDS GARDEN FORK BLUE}           => {CHILDS GARDEN TROWEL BLUE}    0.008707194  0.9704142 0.008972657 71.11966   164
#> [18] {RED SPOTTY CHARLOTTE BAG,                                                                                             
#>       STRAWBERRY CERAMIC TRINKET BOX,                                                                                       
#>       SMALL GLASS HEART TRINKET POT}     => {DOTCOM POSTAGE}               0.005203079  0.9702970 0.005362357 24.93253    98
#> [19] {CHILDS GARDEN TROWEL BLUE,                                                                                            
#>       CHILDS GARDEN FORK PINK,                                                                                              
#>       CHILDRENS GARDEN GLOVES BLUE}      => {CHILDRENS GARDEN GLOVES PINK} 0.005149987  0.9700000 0.005309265 68.68402    97
#> [20] {PACK OF 72 RETRO SPOT CAKE CASES,                                                                                     
#>       SWEETHEART CERAMIC TRINKET BOX,                                                                                       
#>       ANTIQUE SILVER TEA GLASS ETCHED}   => {DOTCOM POSTAGE}               0.005149987  0.9700000 0.005309265 24.92490    97

Top rules by lift expose tightly coupled product pairs (e.g. complementary tools, matched seasonal decorations). High-lift rules are niche in absolute frequency but exceptionally strong in relative terms and well-suited for targeted recommendations. High-confidence rules that also carry high lift are analytically reliable - the pattern is not driven purely by a popular RHS item.

6 Curation - reporting-oriented views

At this stage the rules are already filtered for minimum lift and non-redundancy. To make the output presentation-friendly we focus on rules with a single-item RHS, which are the easiest to communicate and deploy. We then produce three views: highest-lift bundles, highest-confidence predictors, and cross-sell candidates with meaningful commercial scale.

rules_rhs1 <- rules[size(rhs(rules)) == 1]
q          <- quality(rules_rhs1)

rules_bundle <- head(sort(rules_rhs1, by = "lift"),       30)
rules_conf   <- head(sort(rules_rhs1, by = "confidence"), 30)
rules_xsell  <- rules_rhs1[q$support >= 0.01 & q$lift >= 2] %>%
  sort(by = "lift") %>%
  head(30)

write.csv(as(rules_bundle, "data.frame"), file.path("data", "rules_top_lift.csv"),       row.names = FALSE)
write.csv(as(rules_conf,   "data.frame"), file.path("data", "rules_top_confidence.csv"), row.names = FALSE)
write.csv(as(rules_xsell,  "data.frame"), file.path("data", "rules_cross_sell.csv"),     row.names = FALSE)

cat(sprintf("Single-RHS rules: %d\n", length(rules_rhs1)))
#> Single-RHS rules: 14957
cat("Top 10 by lift:\n");       inspect(head(rules_bundle, 10))
#> Top 10 by lift:
#>      lhs                                       rhs                                       support confidence    coverage      lift count
#> [1]  {KIDS RAIN MAC BLUE}                   => {KIDS RAIN MAC PINK}                  0.005521635  0.8524590 0.006477303 124.46562   104
#> [2]  {KIDS RAIN MAC PINK}                   => {KIDS RAIN MAC BLUE}                  0.005521635  0.8062016 0.006848951 124.46562   104
#> [3]  {CHRISTMAS TREE DECORATION WITH BELL,                                                                                             
#>       CHRISTMAS TREE HEART DECORATION}      => {CHRISTMAS TREE STAR DECORATION}      0.005149987  0.8660714 0.005946376 114.87645    97
#> [4]  {CHRISTMAS TREE HEART DECORATION,                                                                                                 
#>       CHRISTMAS TREE STAR DECORATION}       => {CHRISTMAS TREE DECORATION WITH BELL} 0.005149987  0.8508772 0.006052562 107.55887    97
#> [5]  {CAST IRON HOOK GARDEN FORK}           => {CAST IRON HOOK GARDEN TROWEL}        0.007273693  0.8838710 0.008229360 107.40458   137
#> [6]  {CAST IRON HOOK GARDEN TROWEL}         => {CAST IRON HOOK GARDEN FORK}          0.007273693  0.8838710 0.008229360 107.40458   137
#> [7]  {CHRISTMAS TREE DECORATION WITH BELL,                                                                                             
#>       CHRISTMAS TREE STAR DECORATION}       => {CHRISTMAS TREE HEART DECORATION}     0.005149987  0.8660714 0.005946376 101.95285    97
#> [8]  {CHRISTMAS TREE STAR DECORATION}       => {CHRISTMAS TREE DECORATION WITH BELL} 0.005946376  0.7887324 0.007539156  99.70319   112
#> [9]  {CHRISTMAS TREE DECORATION WITH BELL}  => {CHRISTMAS TREE STAR DECORATION}      0.005946376  0.7516779 0.007910804  99.70319   112
#> [10] {CHRISTMAS TREE STAR DECORATION}       => {CHRISTMAS TREE HEART DECORATION}     0.006052562  0.8028169 0.007539156  94.50660   114
cat("\nTop 10 by confidence:\n"); inspect(head(rules_conf,   10))
#> 
#> Top 10 by confidence:
#>      lhs                                    rhs                             support confidence    coverage     lift count
#> [1]  {18098c}                            => {DOTCOM POSTAGE}            0.006902044  1.0000000 0.006902044 25.69577   130
#> [2]  {CHILDS GARDEN TROWEL BLUE,                                                                                         
#>       CHILDS GARDEN FORK PINK,                                                                                           
#>       CHILDRENS GARDEN GLOVES BLUE}      => {CHILDS GARDEN TROWEL PINK} 0.005309265  1.0000000 0.005309265 70.01859   100
#> [3]  {CHILDS GARDEN TROWEL BLUE,                                                                                         
#>       CHILDRENS GARDEN GLOVES BLUE,                                                                                      
#>       CHILDRENS GARDEN GLOVES PINK}      => {CHILDS GARDEN TROWEL PINK} 0.006264932  1.0000000 0.006264932 70.01859   118
#> [4]  {CHILDS GARDEN TROWEL BLUE,                                                                                         
#>       CHILDRENS GARDEN GLOVES PINK}      => {CHILDS GARDEN TROWEL PINK} 0.006742766  0.9921875 0.006795859 69.47157   127
#> [5]  {CHILDS GARDEN TROWEL BLUE,                                                                                         
#>       CHILDS GARDEN FORK BLUE,                                                                                           
#>       CHILDS GARDEN FORK PINK}           => {CHILDS GARDEN TROWEL PINK} 0.008282453  0.9873418 0.008388638 69.13228   156
#> [6]  {CHILDS GARDEN TROWEL PINK,                                                                                         
#>       CHILDS GARDEN FORK BLUE,                                                                                           
#>       CHILDS GARDEN FORK PINK}           => {CHILDS GARDEN TROWEL BLUE} 0.008282453  0.9873418 0.008388638 72.36024   156
#> [7]  {CHILDS GARDEN TROWEL PINK,                                                                                         
#>       CHILDS GARDEN FORK BLUE,                                                                                           
#>       CHILDRENS GARDEN GLOVES PINK}      => {CHILDS GARDEN TROWEL BLUE} 0.005468543  0.9809524 0.005574728 71.89198   103
#> [8]  {STRAWBERRY CERAMIC TRINKET BOX,                                                                                    
#>       ENVELOPE 50 ROMANTIC IMAGES}       => {DOTCOM POSTAGE}            0.005415450  0.9807692 0.005521635 25.20162   102
#> [9]  {PACK OF 72 RETRO SPOT CAKE CASES,                                                                                  
#>       CHARLOTTE BAG , PINK/WHITE SPOTS,                                                                                  
#>       ANTIQUE SILVER TEA GLASS ETCHED}   => {DOTCOM POSTAGE}            0.005415450  0.9807692 0.005521635 25.20162   102
#> [10] {PACK OF 72 RETRO SPOT CAKE CASES,                                                                                  
#>       SMALL GLASS HEART TRINKET POT,                                                                                     
#>       ANTIQUE SILVER TEA GLASS ETCHED}   => {DOTCOM POSTAGE}            0.005203079  0.9800000 0.005309265 25.18186    98
cat("\nTop 10 cross-sell (supp ≥ 0.01, lift ≥ 2):\n"); inspect(head(rules_xsell, 10))
#> 
#> Top 10 cross-sell (supp ≥ 0.01, lift ≥ 2):
#>      lhs                                rhs                               support confidence   coverage     lift count
#> [1]  {CHILDS GARDEN TROWEL BLUE}     => {CHILDS GARDEN FORK BLUE}      0.01008760  0.7392996 0.01364481 67.59567   190
#> [2]  {CHILDS GARDEN FORK BLUE}       => {CHILDS GARDEN TROWEL BLUE}    0.01008760  0.9223301 0.01093709 67.59567   190
#> [3]  {CHILDS GARDEN TROWEL PINK}     => {CHILDS GARDEN FORK PINK}      0.01083090  0.7583643 0.01428192 65.22279   204
#> [4]  {CHILDS GARDEN FORK PINK}       => {CHILDS GARDEN TROWEL PINK}    0.01083090  0.9315068 0.01162729 65.22279   204
#> [5]  {CHILDS GARDEN TROWEL BLUE}     => {CHILDS GARDEN TROWEL PINK}    0.01130873  0.8287938 0.01364481 58.03097   213
#> [6]  {CHILDS GARDEN TROWEL PINK}     => {CHILDS GARDEN TROWEL BLUE}    0.01130873  0.7918216 0.01428192 58.03097   213
#> [7]  {POPPY'S PLAYHOUSE BEDROOM,                                                                                      
#>       POPPY'S PLAYHOUSE KITCHEN}     => {POPPY'S PLAYHOUSE LIVINGROOM} 0.01093709  0.7518248 0.01454739 54.25525   206
#> [8]  {POPPY'S PLAYHOUSE LIVINGROOM,                                                                                   
#>       POPPY'S PLAYHOUSE KITCHEN}     => {POPPY'S PLAYHOUSE BEDROOM}    0.01093709  0.8803419 0.01242368 52.14226   206
#> [9]  {POPPY'S PLAYHOUSE LIVINGROOM}  => {POPPY'S PLAYHOUSE BEDROOM}    0.01189275  0.8582375 0.01385718 50.83303   224
#> [10] {POPPY'S PLAYHOUSE BEDROOM}     => {POPPY'S PLAYHOUSE LIVINGROOM} 0.01189275  0.7044025 0.01688346 50.83303   224

How to read a rule row:
support - fraction of all invoices containing both LHS and RHS;
confidence - how often RHS occurs when LHS is present;
lift - strength relative to random co-occurrence. A cross-sell support threshold (≥ 1%) ensures the pattern occurs at non-trivial scale.

7 Actionable shortlist

For a compact, deployment-ready rule set we restrict attention to short antecedents (≤ 2 items), keep a single-item RHS, remove operational artifacts (shipping / fees), and enforce balanced support / confidence / lift thresholds. The objective is to avoid both ultra-rare niche bundles and rules that are statistically strong only because they point to globally popular items.

rules_act <- rules[
  size(lhs(rules)) <= 2 &
  size(rhs(rules)) == 1
]
rules_act <- rules_act[!(lhs(rules_act) %pin% DROP_TERMS)]
rules_act <- rules_act[!(rhs(rules_act) %pin% DROP_TERMS)]

q_act     <- quality(rules_act)
rules_act <- rules_act[
  q_act$support    >= SUPP_ACT &
  q_act$confidence >= CONF_ACT &
  q_act$lift       >= LIFT_LO  &
  q_act$lift       <= LIFT_HI
]

rules_top50 <- head(sort(rules_act, by = "lift"), 50)

cat(sprintf("Actionable candidate rules: %d\n", length(rules_act)))
#> Actionable candidate rules: 280

write.csv(as(rules_top50, "data.frame"), file.path("data", "rules_actionable_top50.csv"), row.names = FALSE)
saveRDS(rules_top50, file.path("data", "rules_actionable_top50.rds"))
inspect(head(rules_top50, 20))
#>      lhs                                    rhs                                    support confidence   coverage     lift count
#> [1]  {WOODLAND CHARLOTTE BAG,                                                                                                  
#>       RED SPOTTY CHARLOTTE BAG}          => {CHARLOTTE BAG , PINK/WHITE SPOTS}  0.01555615  0.6720183 0.02314839 14.64984   293
#> [2]  {RED SPOTTY CHARLOTTE BAG,                                                                                                
#>       CHARLOTTE BAG , PINK/WHITE SPOTS}  => {STRAWBERRY CHARLOTTE BAG}          0.01598089  0.5271454 0.03031590 14.64422   301
#> [3]  {PINK BLUE FELT CRAFT TRINKET BOX}  => {PINK CREAM FELT CRAFT TRINKET BOX} 0.02129015  0.6159754 0.03456331 14.28805   401
#> [4]  {PINK CREAM FELT CRAFT TRINKET BOX} => {PINK BLUE FELT CRAFT TRINKET BOX}  0.02129015  0.4938424 0.04311123 14.28805   401
#> [5]  {WOODLAND CHARLOTTE BAG,                                                                                                  
#>       CHARLOTTE BAG , PINK/WHITE SPOTS}  => {RED SPOTTY CHARLOTTE BAG}          0.01555615  0.7751323 0.02006902 14.11955   293
#> [6]  {PLASTERS IN TIN SPACEBOY}          => {PLASTERS IN TIN WOODLAND ANIMALS}  0.01582161  0.4467766 0.03541280 13.95529   298
#> [7]  {PLASTERS IN TIN WOODLAND ANIMALS}  => {PLASTERS IN TIN SPACEBOY}          0.01582161  0.4941957 0.03201487 13.95529   298
#> [8]  {CHARLOTTE BAG SUKI DESIGN,                                                                                               
#>       CHARLOTTE BAG , PINK/WHITE SPOTS}  => {RED SPOTTY CHARLOTTE BAG}          0.01736130  0.7500000 0.02314839 13.66175   327
#> [9]  {PLASTERS IN TIN SPACEBOY}          => {PLASTERS IN TIN CIRCUS PARADE}     0.01640563  0.4632684 0.03541280 13.38291   309
#> [10] {PLASTERS IN TIN CIRCUS PARADE}     => {PLASTERS IN TIN SPACEBOY}          0.01640563  0.4739264 0.03461641 13.38291   309
#> [11] {84970L}                            => {84970S}                            0.02139634  0.6287051 0.03403239 13.08471   403
#> [12] {84970S}                            => {84970L}                            0.02139634  0.4453039 0.04804885 13.08471   403
#> [13] {LARGE POPCORN HOLDER}              => {SMALL POPCORN HOLDER}              0.02139634  0.6773109 0.03159012 13.05747   403
#> [14] {SMALL POPCORN HOLDER}              => {LARGE POPCORN HOLDER}              0.02139634  0.4124872 0.05187152 13.05747   403
#> [15] {RETRO SPOT LARGE MILK JUG}         => {RED RETROSPOT SMALL MILK JUG}      0.01730820  0.5182830 0.03339527 12.69423   326
#> [16] {RED RETROSPOT SMALL MILK JUG}      => {RETRO SPOT LARGE MILK JUG}         0.01730820  0.4239272 0.04082825 12.69423   326
#> [17] {RED SPOTTY CHARLOTTE BAG,                                                                                                
#>       CHARLOTTE BAG , PINK/WHITE SPOTS}  => {WOODLAND CHARLOTTE BAG}            0.01555615  0.5131349 0.03031590 12.65039   293
#> [18] {TOY TIDY PINK RETROSPOT}           => {RECYCLING BAG RETROSPOT}           0.01927263  0.5671875 0.03397929 12.64258   363
#> [19] {RECYCLING BAG RETROSPOT}           => {TOY TIDY PINK RETROSPOT}           0.01927263  0.4295858 0.04486329 12.64258   363
#> [20] {RED SPOTTY CHARLOTTE BAG,                                                                                                
#>       CHARLOTTE BAG , PINK/WHITE SPOTS}  => {CHARLOTTE BAG SUKI DESIGN}         0.01736130  0.5726795 0.03031590 12.16056   327

The shortlist balances non-trivial support with high relative strength (lift bounded below by 1.5 and above by 15). The upper lift cap removes variant-completion patterns such as tightly matched sets or product variants sold together, which are statistically strong but often weak from a recommendation-design perspective.

8 Visualization

8.1 Support vs Confidence - global rule space

plot(rules_top50,
     measure = c("support", "confidence"),
     shading = "lift")

Most rules cluster at low support values, with the highest-lift rules appearing as isolated points. This is consistent with the empirical structure of retail baskets: strong associations tend to be localized, and the lift shading highlights rules that are genuinely stronger than baseline item popularity would predict.

8.2 Network graph - top 20 rules by lift

set.seed(42)
rules_net <- head(sort(rules_top50, by = "lift"), 20)
plot(rules_net, method = "graph", engine = "htmlwidget")

The graph reveals product clusters (e.g. tightly coupled variants and seasonal bundles) and highlights hub items that frequently appear as RHS. A fixed seed is used so the force-directed layout remains stable across renders, which makes side-by-side comparison and review easier. Such hubs represent natural anchor products for recommendation placement and bundle pricing strategies.

9 Summary

Purchasing behaviour in the Online Retail II dataset is strongly structured rather than random. By using StockCode as the underlying item identifier, excluding one-item baskets, profiling temporal concentration, and applying stricter lift-based filtering, the analysis is less exposed to noise from description variants, inactive baskets, and near-random co-occurrences. The resulting rule sets are therefore more interpretable, more robust, and closer to what could plausibly support recommendation, bundling, or merchandising decisions.

10 Reproducibility Appendix

required_pkgs <- c("readxl", "dplyr", "stringr", "lubridate", "arules", "arulesViz")

pkg_versions <- tibble::tibble(
  package = c("R", required_pkgs),
  version = c(
    as.character(getRversion()),
    vapply(required_pkgs, function(p) as.character(packageVersion(p)), character(1))
  )
)

knitr::kable(pkg_versions, caption = "R and package versions used in this run")
R and package versions used in this run
package version
R 4.5.2
readxl 1.4.5
dplyr 1.1.4
stringr 1.6.0
lubridate 1.9.5
arules 1.7.13
arulesViz 1.5.4
sessionInfo()
#> R version 4.5.2 (2025-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26200)
#> 
#> Matrix products: default
#>   LAPACK version 3.12.1
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.utf8 
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: Europe/Warsaw
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] arulesViz_1.5.4 arules_1.7.13   Matrix_1.7-4    lubridate_1.9.5
#> [5] stringr_1.6.0   dplyr_1.1.4     readxl_1.4.5   
#> 
#> loaded via a namespace (and not attached):
#>  [1] viridis_0.6.5      utf8_1.2.6         sass_0.4.10        generics_0.1.4    
#>  [5] tidyr_1.3.2        stringi_1.8.7      lattice_0.22-7     digest_0.6.39     
#>  [9] magrittr_2.0.4     evaluate_1.0.5     grid_4.5.2         timechange_0.4.0  
#> [13] RColorBrewer_1.1-3 fastmap_1.2.0      cellranger_1.1.0   jsonlite_2.0.0    
#> [17] ggrepel_0.9.7      gridExtra_2.3      purrr_1.2.1        viridisLite_0.4.3 
#> [21] scales_1.4.0       tweenr_2.0.3       jquerylib_0.1.4    cli_3.6.5         
#> [25] graphlayouts_1.2.3 rlang_1.1.6        polyclip_1.10-7    visNetwork_2.1.4  
#> [29] tidygraph_1.3.1    withr_3.0.2        cachem_1.1.0       yaml_2.3.12       
#> [33] otel_0.2.0         tools_4.5.2        memoise_2.0.1      ggplot2_4.0.2     
#> [37] vctrs_0.6.5        R6_2.6.1           lifecycle_1.0.5    htmlwidgets_1.6.4 
#> [41] MASS_7.3-65        ggraph_2.2.2       pkgconfig_2.0.3    pillar_1.11.1     
#> [45] bslib_0.10.0       gtable_0.3.6       Rcpp_1.1.0         glue_1.8.0        
#> [49] ggforce_0.5.0      xfun_0.56          tibble_3.3.0       tidyselect_1.2.1  
#> [53] rstudioapi_0.18.0  knitr_1.51         farver_2.1.2       htmltools_0.5.9   
#> [57] igraph_2.2.2       labeling_0.4.3     rmarkdown_2.30     compiler_4.5.2    
#> [61] S7_0.2.1