online_retail

0.1 0.1 Association Rules on Online Retail II
0.2 1) Load and clean the invoice-line table
0.3 2) Quick profiling and sanity checks
0.4 3) Build market baskets and transactions
0.5 4) Mine frequent itemsets and association rules
0.6 5) Curation: reporting-oriented rule views
0.7 6) Actionable shortlist
0.8 7) Visualization
0.9 Summary

0.1 0.1 Association Rules on Online Retail II

This report presents a complete, end-to-end implementation of association rules mining applied to the Online Retail II dataset. The original data are provided at the invoice-line level, where each observation corresponds to a single product item recorded on a specific invoice. Such a structure is not directly suitable for market-basket analysis, which requires transactional data.

Therefore, the raw invoice-line table is transformed into a transactional representation, where each invoice corresponds to a single basket and products are encoded as binary indicators of presence or absence. This transformation enables the identification of co-occurrence patterns between items purchased together within the same transaction.

The analysis focuses on discovering frequent itemsets and association rules that capture systematic purchasing relationships. To evaluate the strength and relevance of the extracted rules, we rely on the standard metrics used in association rule mining:

Support, defined as the proportion of all transactions that contain both the left-hand side (LHS) and the right-hand side (RHS) of a rule. Support reflects how frequently a given combination of items occurs in the dataset.
Confidence, defined as \(P(\text{RHS} \mid \text{LHS})\), which measures the conditional probability of observing the RHS given that the LHS is present in a transaction. Confidence quantifies the predictive strength of the rule.
Lift, defined as the ratio of confidence to the marginal probability of the RHS, \(\text{Confidence} / P(\text{RHS})\). Lift compares the observed co-occurrence to what would be expected under statistical independence, with values greater than one indicating a positive association beyond baseline frequency.

Together, these measures allow for a balanced assessment of rule relevance, combining frequency, predictive accuracy, and deviation from random co-occurrence.

0.1.1 0.1.1 Dataset Description

The analysis is based on the Online Retail II dataset, publicly available from the UCI Machine Learning Repository. The dataset contains transactional data from a UK-based online retailer selling unique all-occasion giftware. It covers all invoice-line transactions recorded between 01/12/2009 and 09/12/2011, providing a large-scale and well-documented example of real-world retail purchase behavior.

Each observation corresponds to a single product line within an invoice and includes information such as the invoice identifier, stock code, product description, quantity, invoice date, unit price, customer identifier, and country of origin. The dataset has been widely used in the literature and in applied research on market-basket analysis, recommendation systems, and consumer behavior, which supports its suitability and external validity for association rule mining.

The data were obtained from the UCI Machine Learning Repository:

https://archive.ics.uci.edu/dataset/352/online+retail

cat("R Markdown pipeline is running.\n")
#> R Markdown pipeline is running.

0.2 1) Load and clean the invoice-line table

We keep only completed sales lines:

drop missing invoice / item description,
keep positive quantities,
remove credit notes / returns (in this dataset they typically have invoice numbers starting with “C”),
standardize product names for stable item identity.

df <- read_excel("online_retail_II.xlsx")

df_clean <- df %>%
  filter(
    !is.na(Invoice),
    !is.na(Description),
    Quantity > 0,
    !str_detect(as.character(Invoice), "^C")
  ) %>%
  mutate(
    Invoice       = as.character(Invoice),
    StockCode     = as.character(StockCode),
    Description   = str_squish(str_to_upper(Description)),
    `Customer ID` = as.integer(`Customer ID`),
    Country       = as.character(Country)
  ) %>%
  select(Invoice, StockCode, Description, Quantity, InvoiceDate, Price, `Customer ID`, Country)

saveRDS(df_clean, "online_retail_clean.rds")

0.3 2) Quick profiling and sanity checks

These checks confirm that cleaning worked and give intuition for reasonable parameter ranges later (e.g., minimum support and maximum rule length).

cat("Rows after cleaning:", nrow(df_clean), "\n")
#> Rows after cleaning: 512033
cat("Distinct invoices (transactions):", n_distinct(df_clean$Invoice), "\n")
#> Distinct invoices (transactions): 21002
cat("Distinct items (Description):", n_distinct(df_clean$Description), "\n")
#> Distinct items (Description): 4515

0.3.1 2.1 Data Scale and Structure After Cleaning

After cleaning, the output indicates 512,033 sales lines, 21,002 distinct invoices (transactions), and 4,515 unique products. This implies an average of approximately 24 line items per invoice, indicating a non-trivial dataset with moderately complex baskets.

The combination of a large item universe and a relatively limited number of transactions implies a highly sparse transactional space. This motivates conservative support thresholds and explicit limits on rule length in later stages to prevent combinatorial explosion and unstable patterns.

df_clean %>%
  count(Description, sort = TRUE) %>%
  slice_head(n = 20)
#> # A tibble: 20 × 2
#>    Description                            n
#>    <chr>                              <int>
#>  1 WHITE HANGING HEART T-LIGHT HOLDER  3456
#>  2 REGENCY CAKESTAND 3 TIER            2046
#>  3 STRAWBERRY CERAMIC TRINKET BOX      1714
#>  4 PACK OF 72 RETRO SPOT CAKE CASES    1456
#>  5 ASSORTED COLOUR BIRD ORNAMENT       1450
#>  6 60 TEATIME FAIRY CAKE CASES         1394
#>  7 HOME BUILDING BLOCK WORD            1376
#>  8 JUMBO BAG RED RETROSPOT             1280
#>  9 LUNCH BAG RED SPOTTY                1246
#> 10 REX CASH+CARRY JUMBO SHOPPER        1226
#> 11 JUMBO STORAGE BAG SUKI              1203
#> 12 PACK OF 60 PINK PAISLEY CAKE CASES  1191
#> 13 WOODEN FRAME ANTIQUE WHITE          1169
#> 14 LUNCH BAG BLACK SKULL.              1156
#> 15 LUNCH BAG SUKI DESIGN               1146
#> 16 HEART OF WICKER LARGE               1145
#> 17 LOVE BUILDING BLOCK WORD            1129
#> 18 RED HANGING HEART T-LIGHT HOLDER    1106
#> 19 JUMBO SHOPPER VINTAGE RED PAISLEY   1085
#> 20 JUMBO BAG STRAWBERRY                1078

0.3.2 2.2 Product Popularity and Demand Concentration

The top-20 frequency table reveals strong demand concentration. The most frequent products include:

WHITE HANGING HEART T-LIGHT HOLDER (3,456 occurrences),
REGENCY CAKESTAND 3 TIER (2,046),
STRAWBERRY CERAMIC TRINKET BOX (1,714),
PACK OF 72 RETRO SPOT CAKE CASES (1,456).

These items are globally popular and therefore tend to appear in many baskets. As a consequence, rules involving such items—especially on the RHS—may achieve high confidence even without strong conditional dependence. This reinforces the need to interpret confidence jointly with lift.

df_clean %>%
  count(Invoice, name = "n_lines") %>%
  summarise(
    mean_lines   = mean(n_lines),
    median_lines = median(n_lines),
    p90_lines    = quantile(n_lines, 0.90),
    p99_lines    = quantile(n_lines, 0.99)
  )
#> # A tibble: 1 × 4
#>   mean_lines median_lines p90_lines p99_lines
#>        <dbl>        <dbl>     <dbl>     <dbl>
#> 1       24.4           15        51       180

0.3.3 2.3 Invoice Complexity and Basket Size Distribution

The invoice-size summary shows a heavy right tail: the median invoice contains 15 items, the mean is 24.4 items, the 90th percentile is 51, and the 99th percentile reaches 180 items. This indicates a small number of very large baskets which can generate a disproportionate number of co-occurrences and inflate higher-order patterns.

This empirical structure directly justifies the later constraints on rule length (maxlen) and the reliance on minimum support thresholds to stabilize the mining process.

0.4 3) Build market baskets and transactions

Association rules are typically mined on binary baskets: whether an item was present in the invoice at least once. We therefore de-duplicate (Invoice, Description) before building baskets.

df_clean <- readRDS("online_retail_clean.rds")

baskets <- df_clean %>%
  distinct(Invoice, Description) %>%
  group_by(Invoice) %>%
  summarise(items = list(Description), .groups = "drop")

trans <- as(baskets$items, "transactions")
saveRDS(trans, "online_retail_transactions.rds")

summary(trans)
#> transactions as itemMatrix in sparse format with
#>  21002 rows (elements/itemsets/transactions) and
#>  4515 columns (items) and a density of 0.005254849 
#> 
#> most frequent items:
#> WHITE HANGING HEART T-LIGHT HOLDER           REGENCY CAKESTAND 3 TIER 
#>                               3316                               2020 
#>     STRAWBERRY CERAMIC TRINKET BOX      ASSORTED COLOUR BIRD ORNAMENT 
#>                               1640                               1413 
#>   PACK OF 72 RETRO SPOT CAKE CASES                            (Other) 
#>                               1410                             488487 
#> 
#> element (itemset/transaction) length distribution:
#> sizes
#>    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
#> 2168  877  724  683  647  618  592  551  615  541  570  561  556  525  546  517 
#>   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
#>  512  527  500  491  437  400  369  323  308  294  270  260  246  237  206  179 
#>   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
#>  173  162  151  153  144  130  140  153  138  112  107  102   88   86  115   80 
#>   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
#>   84   84   82   76   57   56   61   58   64   55   43   49   41   45   37   42 
#>   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79   80 
#>   40   28   22   28   28   32   34   22   24   27   19   17   15   15   21   15 
#>   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95   96 
#>   12   11   15   18    9    9   13   14   19   15    7    7    8    8   15    7 
#>   97   98   99  100  101  102  103  104  105  106  107  108  109  110  111  112 
#>    9    8   10   12    7    4    8    5    8    6    9    9    8    7    6    4 
#>  113  114  115  116  117  118  119  120  121  122  123  125  126  127  128  129 
#>    8    8   14    7    8    5    5    3    9    5    5    5    4    6    3    3 
#>  130  131  132  133  134  135  136  137  138  139  140  141  142  143  144  145 
#>    5    5    6    8    7    9    4    5    5    2    2    2    4    5    2    6 
#>  146  147  148  149  150  151  152  153  154  155  156  157  158  159  160  161 
#>    4    4    6    3    2    8    1    3    3    4    4    2    5    3    1    4 
#>  162  163  164  165  166  167  168  169  170  171  173  174  175  176  177  178 
#>    7    2    4    7    3    3    3    4    1    4    2    3    2    3    2    2 
#>  180  181  183  184  185  186  187  188  189  190  191  192  193  194  195  196 
#>    3    5    3    3    1    4    1    1    3    2    2    3    3    2    2    2 
#>  197  199  200  201  202  203  204  205  206  207  211  212  213  216  217  218 
#>    2    2    4    4    3    1    2    1    2    2    1    1    1    3    2    4 
#>  219  220  221  222  223  224  227  228  229  231  233  237  238  240  241  243 
#>    2    2    2    2    2    2    1    1    1    1    1    1    1    2    2    1 
#>  244  246  247  249  253  254  255  261  263  264  265  267  268  269  272  274 
#>    1    1    3    1    1    2    1    1    1    2    1    1    1    1    2    1 
#>  275  276  279  284  295  296  298  299  307  314  320  323  325  330  335  337 
#>    2    1    1    2    1    1    1    1    1    2    1    1    1    1    1    1 
#>  340  341  343  355  362  368  369  372  376  379  384  400  406  411  416  420 
#>    2    1    2    2    1    2    1    1    1    1    1    2    1    1    1    1 
#>  424  427  429  435  437  438  439  441  446  448  459  460  463  465  466  476 
#>    1    1    2    1    1    1    1    1    1    1    1    1    1    2    1    1 
#>  479  480  483  484  497  498  501  506  512  515  522  536  545  546  556  567 
#>    1    2    1    1    2    1    1    1    1    1    1    1    1    1    1    1 
#>  576  577  578  585  588  589  595  601  647  673 
#>    1    1    1    1    1    1    1    1    1    1 
#> 
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1.00    6.00   15.00   23.73   28.00  673.00 
#> 
#> includes extended item information - examples:
#>                     labels
#> 1    *BOOMBOX IPOD CLASSIC
#> 2 *USB OFFICE GLITTER LAMP
#> 3  *USB OFFICE MIRROR BALL

0.4.1 3.1 Transaction Matrix Properties

The transaction summary confirms 21,002 transactions and 4,515 items, with matrix density 0.00525, meaning that more than 99.4% of entries are zeros. The median basket size after de-duplication remains 15 items, while the maximum reaches 673 items.

This extreme sparsity is typical for retail data and implies that most potential item combinations never occur. Consequently, very high-lift rules are expected to be rare but highly informative.

itemFrequencyPlot(trans, topN = 20, type = "absolute")

Interpretation.
This plot shows the most frequent items by transaction count (true support counts). Very frequent items can dominate high-confidence rules (because many baskets contain them), which is why lift is important.

basket_sizes <- size(trans)
summary(basket_sizes)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1.00    6.00   15.00   23.73   28.00  673.00

Interpretation.
This is the basket size distribution after item de-duplication. If typical baskets are small, long LHS rules will be rare and often unstable; short LHS rules are usually the most actionable.

0.5 4) Mine frequent itemsets and association rules

0.5.1 4.1 Frequent itemsets (Apriori vs Eclat)

We mine frequent itemsets with:

supp = 0.01 (≥ 1% of transactions),
maxlen = 3 to keep results interpretable.

Apriori and Eclat should return comparable results given the same thresholds; the difference is mainly computational strategy.

0.5.2 4.2 Rules (Apriori)

For rules we use:

lower support supp = 0.005 (to allow less frequent cross-sell patterns),
conf = 0.20 to reduce noisy rules,
maxlen = 4 and minlen = 2.

Then we keep only positive associations (lift > 1.2) and remove redundant rules.

trans <- readRDS("online_retail_transactions.rds")

fis_apriori <- apriori(
  trans,
  parameter = list(
    target = "frequent itemsets",
    supp   = 0.01,
    maxlen = 3
  )
)
#> Apriori
#> 
#> Parameter specification:
#>  confidence minval smax arem  aval originalSupport maxtime support minlen
#>          NA    0.1    1 none FALSE            TRUE       5    0.01      1
#>  maxlen            target  ext
#>       3 frequent itemsets TRUE
#> 
#> Algorithmic control:
#>  filter tree heap memopt load sort verbose
#>     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
#> 
#> Absolute minimum support count: 210 
#> 
#> set item appearances ...[0 item(s)] done [0.00s].
#> set transactions ...[4515 item(s), 21002 transaction(s)] done [0.14s].
#> sorting and recoding items ... [703 item(s)] done [0.01s].
#> creating transaction tree ... done [0.01s].
#> checking subsets of size 1 2 3
#>  done [0.01s].
#> sorting transactions ... done [0.00s].
#> writing ... [1100 set(s)] done [0.00s].
#> creating S4 object  ... done [0.00s].

fis_eclat <- eclat(
  trans,
  parameter = list(
    supp   = 0.01,
    maxlen = 3
  )
)
#> Eclat
#> 
#> parameter specification:
#>  tidLists support minlen maxlen            target  ext
#>     FALSE    0.01      1      3 frequent itemsets TRUE
#> 
#> algorithmic control:
#>  sparse sort verbose
#>       7   -2    TRUE
#> 
#> Absolute minimum support count: 210 
#> 
#> create itemset ... 
#> set transactions ...[4515 item(s), 21002 transaction(s)] done [0.11s].
#> sorting and recoding items ... [703 item(s)] done [0.01s].
#> creating sparse bit matrix ... [703 row(s), 21002 column(s)] done [0.00s].
#> writing  ... [1100 set(s)] done [0.46s].
#> Creating S4 object  ... done [0.00s].

rules_raw <- apriori(
  trans,
  parameter = list(
    supp   = 0.005,
    conf   = 0.20,
    minlen = 2,
    maxlen = 4
  )
)
#> Apriori
#> 
#> Parameter specification:
#>  confidence minval smax arem  aval originalSupport maxtime support minlen
#>         0.2    0.1    1 none FALSE            TRUE       5   0.005      2
#>  maxlen target  ext
#>       4  rules TRUE
#> 
#> Algorithmic control:
#>  filter tree heap memopt load sort verbose
#>     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
#> 
#> Absolute minimum support count: 105 
#> 
#> set item appearances ...[0 item(s)] done [0.00s].
#> set transactions ...[4515 item(s), 21002 transaction(s)] done [0.12s].
#> sorting and recoding items ... [1383 item(s)] done [0.01s].
#> creating transaction tree ... done [0.00s].
#> checking subsets of size 1 2 3 4
#>  done [0.05s].
#> writing ... [5365 rule(s)] done [0.00s].
#> creating S4 object  ... done [0.00s].

rules <- subset(rules_raw, subset = lift > 1.2)
rules <- rules[!is.redundant(rules)]

saveRDS(fis_apriori, "fis_apriori.rds")
saveRDS(fis_eclat,   "fis_eclat.rds")
saveRDS(rules,       "rules_final.rds")

cat("Frequent itemsets (Apriori):", length(fis_apriori), "\n")
#> Frequent itemsets (Apriori): 1100
cat("Frequent itemsets (Eclat):", length(fis_eclat), "\n")
#> Frequent itemsets (Eclat): 1100
cat("Rules (filtered & non-redundant):", length(rules), "\n")
#> Rules (filtered & non-redundant): 5299

0.5.3 4.3 Frequent Itemsets: Apriori vs Eclat Consistency

Using identical parameters (supp = 0.01, maxlen = 3), both Apriori and Eclat return the same number of frequent itemsets (1,100 in the output). This indicates that the mined frequent structures are robust to algorithm choice and primarily reflect genuine co-occurrence patterns rather than mining artifacts.

inspect(head(sort(fis_apriori, by = "support"), 20))
#>      items                                support    count
#> [1]  {WHITE HANGING HEART T-LIGHT HOLDER} 0.15788972 3316 
#> [2]  {REGENCY CAKESTAND 3 TIER}           0.09618132 2020 
#> [3]  {STRAWBERRY CERAMIC TRINKET BOX}     0.07808780 1640 
#> [4]  {ASSORTED COLOUR BIRD ORNAMENT}      0.06727931 1413 
#> [5]  {PACK OF 72 RETRO SPOT CAKE CASES}   0.06713646 1410 
#> [6]  {60 TEATIME FAIRY CAKE CASES}        0.06361299 1336 
#> [7]  {HOME BUILDING BLOCK WORD}           0.06337492 1331 
#> [8]  {JUMBO BAG RED RETROSPOT}            0.05937530 1247 
#> [9]  {LUNCH BAG RED SPOTTY}               0.05799448 1218 
#> [10] {JUMBO STORAGE BAG SUKI}             0.05618513 1180 
#> [11] {PACK OF 60 PINK PAISLEY CAKE CASES} 0.05499476 1155 
#> [12] {WOODEN FRAME ANTIQUE WHITE}         0.05428054 1140 
#> [13] {LUNCH BAG BLACK SKULL.}             0.05309018 1115 
#> [14] {LUNCH BAG SUKI DESIGN}              0.05304257 1114 
#> [15] {HEART OF WICKER LARGE}              0.05218551 1096 
#> [16] {LOVE BUILDING BLOCK WORD}           0.05194743 1091 
#> [17] {REX CASH+CARRY JUMBO SHOPPER}       0.05118560 1075 
#> [18] {RED HANGING HEART T-LIGHT HOLDER}   0.05075707 1066 
#> [19] {JUMBO SHOPPER VINTAGE RED PAISLEY}  0.05037615 1058 
#> [20] {JUMBO BAG STRAWBERRY}               0.05032854 1057
inspect(head(sort(fis_eclat, by = "support"), 20))
#>      items                                support    count
#> [1]  {WHITE HANGING HEART T-LIGHT HOLDER} 0.15788972 3316 
#> [2]  {REGENCY CAKESTAND 3 TIER}           0.09618132 2020 
#> [3]  {STRAWBERRY CERAMIC TRINKET BOX}     0.07808780 1640 
#> [4]  {ASSORTED COLOUR BIRD ORNAMENT}      0.06727931 1413 
#> [5]  {PACK OF 72 RETRO SPOT CAKE CASES}   0.06713646 1410 
#> [6]  {60 TEATIME FAIRY CAKE CASES}        0.06361299 1336 
#> [7]  {HOME BUILDING BLOCK WORD}           0.06337492 1331 
#> [8]  {JUMBO BAG RED RETROSPOT}            0.05937530 1247 
#> [9]  {LUNCH BAG RED SPOTTY}               0.05799448 1218 
#> [10] {JUMBO STORAGE BAG SUKI}             0.05618513 1180 
#> [11] {PACK OF 60 PINK PAISLEY CAKE CASES} 0.05499476 1155 
#> [12] {WOODEN FRAME ANTIQUE WHITE}         0.05428054 1140 
#> [13] {LUNCH BAG BLACK SKULL.}             0.05309018 1115 
#> [14] {LUNCH BAG SUKI DESIGN}              0.05304257 1114 
#> [15] {HEART OF WICKER LARGE}              0.05218551 1096 
#> [16] {LOVE BUILDING BLOCK WORD}           0.05194743 1091 
#> [17] {REX CASH+CARRY JUMBO SHOPPER}       0.05118560 1075 
#> [18] {RED HANGING HEART T-LIGHT HOLDER}   0.05075707 1066 
#> [19] {JUMBO SHOPPER VINTAGE RED PAISLEY}  0.05037615 1058 
#> [20] {JUMBO BAG STRAWBERRY}               0.05032854 1057

Interpretation.
The most supported itemsets correspond to core products and common bundles. Similar top results across Apriori and Eclat is a strong consistency check.

inspect(head(sort(rules, by = "lift"), 20))
#>      lhs                                      rhs                                       support confidence    coverage      lift count
#> [1]  {CAST IRON HOOK GARDEN FORK}          => {CAST IRON HOOK GARDEN TROWEL}        0.006523188  0.8726115 0.007475479 118.23604   137
#> [2]  {CAST IRON HOOK GARDEN TROWEL}        => {CAST IRON HOOK GARDEN FORK}          0.006523188  0.8838710 0.007380250 118.23604   137
#> [3]  {CHRISTMAS TREE DECORATION WITH BELL} => {CHRISTMAS TREE STAR DECORATION}      0.005332825  0.7516779 0.007094562 111.17421   112
#> [4]  {CHRISTMAS TREE STAR DECORATION}      => {CHRISTMAS TREE DECORATION WITH BELL} 0.005332825  0.7887324 0.006761261 111.17421   112
#> [5]  {CHRISTMAS TREE HEART DECORATION}     => {CHRISTMAS TREE STAR DECORATION}      0.005428054  0.7125000 0.007618322 105.37975   114
#> [6]  {CHRISTMAS TREE STAR DECORATION}      => {CHRISTMAS TREE HEART DECORATION}     0.005428054  0.8028169 0.006761261 105.37975   114
#> [7]  {CHRISTMAS TREE HEART DECORATION}     => {CHRISTMAS TREE DECORATION WITH BELL} 0.005332825  0.7000000 0.007618322  98.66711   112
#> [8]  {CHRISTMAS TREE DECORATION WITH BELL} => {CHRISTMAS TREE HEART DECORATION}     0.005332825  0.7516779 0.007094562  98.66711   112
#> [9]  {CHILDS GARDEN FORK PINK,                                                                                                        
#>       CHILDS GARDEN TROWEL BLUE,                                                                                                      
#>       CHILDS GARDEN TROWEL PINK}           => {CHILDS GARDEN FORK BLUE}             0.007427864  0.9397590 0.007904009  95.80980   156
#> [10] {CHILDS GARDEN FORK PINK,                                                                                                        
#>       CHILDS GARDEN TROWEL BLUE}           => {CHILDS GARDEN FORK BLUE}             0.007523093  0.9294118 0.008094467  94.75488   158
#> [11] {CHILDS GARDEN FORK BLUE,                                                                                                        
#>       CHILDS GARDEN TROWEL BLUE,                                                                                                      
#>       CHILDS GARDEN TROWEL PINK}           => {CHILDS GARDEN FORK PINK}             0.007427864  0.9512195 0.007808780  91.22152   156
#> [12] {CHILDS GARDEN FORK BLUE,                                                                                                        
#>       CHILDS GARDEN TROWEL PINK}           => {CHILDS GARDEN FORK PINK}             0.007523093  0.9349112 0.008046853  89.65756   158
#> [13] {BLUE FELT EASTER EGG BASKET}         => {PINK FELT EASTER EGG BASKET}         0.005904200  0.7607362 0.007761166  89.25688   124
#> [14] {PINK FELT EASTER EGG BASKET}         => {BLUE FELT EASTER EGG BASKET}         0.005904200  0.6927374 0.008522998  89.25688   124
#> [15] {CHILDRENS GARDEN GLOVES PINK,                                                                                                   
#>       CHILDS GARDEN TROWEL BLUE,                                                                                                      
#>       CHILDS GARDEN TROWEL PINK}           => {CHILDRENS GARDEN GLOVES BLUE}        0.005618513  0.9291339 0.006047043  88.69850   118
#> [16] {CHILDRENS GARDEN GLOVES PINK,                                                                                                   
#>       CHILDS GARDEN TROWEL BLUE}           => {CHILDRENS GARDEN GLOVES BLUE}        0.005618513  0.9218750 0.006094658  88.00554   118
#> [17] {PACK OF 20 SKULL PAPER NAPKINS,                                                                                                 
#>       PACK OF 6 SKULL PAPER CUPS}          => {PACK OF 6 SKULL PAPER PLATES}        0.006570803  0.9200000 0.007142177  85.49487   138
#> [18] {POPPY'S PLAYHOUSE BEDROOM,                                                                                                      
#>       POPPY'S PLAYHOUSE KITCHEN,                                                                                                      
#>       POPPY'S PLAYHOUSE LIVINGROOM}        => {POPPY'S PLAYHOUSE BATHROOM}          0.007665937  0.7815534 0.009808590  85.49054   161
#> [19] {ENVELOPE 50 ROMANTIC IMAGES}         => {ENVELOPE 50 BLOSSOM IMAGES}          0.005475669  0.6725146 0.008142082  84.07233   115
#> [20] {ENVELOPE 50 BLOSSOM IMAGES}          => {ENVELOPE 50 ROMANTIC IMAGES}         0.005475669  0.6845238 0.007999238  84.07233   115
inspect(head(sort(rules, by = "confidence"), 20))
#>      lhs                                 rhs                                support confidence    coverage     lift count
#> [1]  {CHILDRENS GARDEN GLOVES BLUE,                                                                                      
#>       CHILDRENS GARDEN GLOVES PINK,                                                                                      
#>       CHILDS GARDEN TROWEL BLUE}      => {CHILDS GARDEN TROWEL PINK}    0.005618513  1.0000000 0.005618513 78.07435   118
#> [2]  {CHILDRENS GARDEN GLOVES PINK,                                                                                      
#>       CHILDS GARDEN TROWEL BLUE}      => {CHILDS GARDEN TROWEL PINK}    0.006047043  0.9921875 0.006094658 77.46439   127
#> [3]  {CHILDS GARDEN FORK BLUE,                                                                                           
#>       CHILDS GARDEN FORK PINK,                                                                                           
#>       CHILDS GARDEN TROWEL BLUE}      => {CHILDS GARDEN TROWEL PINK}    0.007427864  0.9873418 0.007523093 77.08607   156
#> [4]  {CHILDS GARDEN FORK BLUE,                                                                                           
#>       CHILDS GARDEN FORK PINK,                                                                                           
#>       CHILDS GARDEN TROWEL PINK}      => {CHILDS GARDEN TROWEL BLUE}    0.007427864  0.9873418 0.007523093 80.68542   156
#> [5]  {CHILDS GARDEN FORK PINK,                                                                                           
#>       CHILDS GARDEN TROWEL BLUE}      => {CHILDS GARDEN TROWEL PINK}    0.007904009  0.9764706 0.008094467 76.23731   166
#> [6]  {POPPY'S PLAYHOUSE BATHROOM,                                                                                        
#>       POPPY'S PLAYHOUSE BEDROOM,                                                                                         
#>       POPPY'S PLAYHOUSE LIVINGROOM}   => {POPPY'S PLAYHOUSE KITCHEN}    0.007665937  0.9757576 0.007856395 59.92065   161
#> [7]  {CHILDS GARDEN FORK BLUE,                                                                                           
#>       CHILDS GARDEN TROWEL PINK}      => {CHILDS GARDEN TROWEL BLUE}    0.007808780  0.9704142 0.008046853 79.30210   164
#> [8]  {CHILDRENS GARDEN GLOVES BLUE,                                                                                      
#>       CHILDRENS GARDEN GLOVES PINK,                                                                                      
#>       CHILDS GARDEN TROWEL PINK}      => {CHILDS GARDEN TROWEL BLUE}    0.005618513  0.9672131 0.005808971 79.04051   118
#> [9]  {POPPY'S PLAYHOUSE BATHROOM,                                                                                        
#>       POPPY'S PLAYHOUSE LIVINGROOM}   => {POPPY'S PLAYHOUSE KITCHEN}    0.008189696  0.9662921 0.008475383 59.33938   172
#> [10] {POPPY'S PLAYHOUSE BATHROOM,                                                                                        
#>       POPPY'S PLAYHOUSE BEDROOM}      => {POPPY'S PLAYHOUSE KITCHEN}    0.007999238  0.9655172 0.008284925 59.29179   168
#> [11] {CHILDRENS GARDEN GLOVES PINK,                                                                                      
#>       CHILDS GARDEN FORK PINK}        => {CHILDS GARDEN TROWEL PINK}    0.006094658  0.9624060 0.006332730 75.13922   128
#> [12] {CHILDRENS GARDEN GLOVES BLUE,                                                                                      
#>       CHILDS GARDEN TROWEL BLUE,                                                                                         
#>       CHILDS GARDEN TROWEL PINK}      => {CHILDRENS GARDEN GLOVES PINK} 0.005618513  0.9593496 0.005856585 75.74534   118
#> [13] {POPPY'S PLAYHOUSE BATHROOM,                                                                                        
#>       POPPY'S PLAYHOUSE BEDROOM,                                                                                         
#>       POPPY'S PLAYHOUSE KITCHEN}      => {POPPY'S PLAYHOUSE LIVINGROOM} 0.007665937  0.9583333 0.007999238 77.11462   161
#> [14] {ENVELOPE 50 BLOSSOM IMAGES,                                                                                        
#>       STRAWBERRY CERAMIC TRINKET BOX} => {DOTCOM POSTAGE}               0.005237596  0.9565217 0.005475669 27.40637   110
#> [15] {CHILDRENS GARDEN GLOVES BLUE,                                                                                      
#>       CHILDS GARDEN TROWEL PINK}      => {CHILDS GARDEN TROWEL BLUE}    0.005856585  0.9534884 0.006142272 77.91892   123
#> [16] {CHILDS GARDEN FORK BLUE,                                                                                           
#>       CHILDS GARDEN TROWEL BLUE,                                                                                         
#>       CHILDS GARDEN TROWEL PINK}      => {CHILDS GARDEN FORK PINK}      0.007427864  0.9512195 0.007808780 91.22152   156
#> [17] {POPPY'S PLAYHOUSE BATHROOM,                                                                                        
#>       POPPY'S PLAYHOUSE BEDROOM}      => {POPPY'S PLAYHOUSE LIVINGROOM} 0.007856395  0.9482759 0.008284925 76.30532   165
#> [18] {POPPY'S PLAYHOUSE BATHROOM}     => {POPPY'S PLAYHOUSE KITCHEN}    0.008665841  0.9479167 0.009141986 58.21095   182
#> [19] {CHILDS GARDEN FORK BLUE,                                                                                           
#>       CHILDS GARDEN FORK PINK}        => {CHILDS GARDEN TROWEL BLUE}    0.007523093  0.9461078 0.007951624 77.31578   158
#> [20] {CHILDS GARDEN FORK BLUE,                                                                                           
#>       CHILDS GARDEN FORK PINK}        => {CHILDS GARDEN TROWEL PINK}    0.007523093  0.9461078 0.007951624 73.86675   158

0.5.4 4.4 High-Lift vs High-Confidence Rules

The top rules by lift are characterized by extremely high lift values (often above 100) while retaining moderate support (~0.5–0.8%). This typically reflects tightly coupled product pairs or variants (e.g., complementary tools or matched seasonal decorations). Although niche in absolute frequency, these rules are exceptionally strong in relative terms and are well suited for targeted recommendations.

Rules ranked by confidence often approach 1.0, implying near-deterministic relationships. These remain analytically meaningful in the output because they are accompanied by very high lift, which confirms that the patterns are not simply driven by globally popular RHS items.

0.6 5) Curation: reporting-oriented rule views

To make results presentation-friendly we focus on:

single-item RHS (recommend one item),
high-lift bundles,
high-confidence predictors,
cross-sell patterns with both meaningful support and lift.

rules_raw <- readRDS("rules_final.rds")

rules_nr <- rules_raw[!is.redundant(rules_raw)]
rules_rhs1 <- subset(rules_nr, subset = size(rhs) == 1)

rules_bundle <- head(sort(rules_rhs1, by = "lift"), 30)
rules_conf   <- head(sort(rules_rhs1, by = "confidence"), 30)

rules_xsell <- rules_rhs1 %>%
  subset(subset = support >= 0.01 & lift >= 2) %>%
  sort(by = "lift") %>%
  head(30)

write.csv(as(rules_bundle, "data.frame"), "rules_top_lift.csv", row.names = FALSE)
write.csv(as(rules_conf,   "data.frame"), "rules_top_confidence.csv", row.names = FALSE)
write.csv(as(rules_xsell,  "data.frame"), "rules_cross_sell.csv", row.names = FALSE)

cat("Rules loaded:", length(rules_raw), "\n")
#> Rules loaded: 5299
cat("Non-redundant rules:", length(rules_nr), "\n")
#> Non-redundant rules: 5299
cat("Non-redundant rules with |RHS|=1:", length(rules_rhs1), "\n")
#> Non-redundant rules with |RHS|=1: 5299

cat("\nTop 10 by lift:\n")
#> 
#> Top 10 by lift:
inspect(head(sort(rules_bundle, by = "lift"), 10))
#>      lhs                                      rhs                                       support confidence    coverage      lift count
#> [1]  {CAST IRON HOOK GARDEN FORK}          => {CAST IRON HOOK GARDEN TROWEL}        0.006523188  0.8726115 0.007475479 118.23604   137
#> [2]  {CAST IRON HOOK GARDEN TROWEL}        => {CAST IRON HOOK GARDEN FORK}          0.006523188  0.8838710 0.007380250 118.23604   137
#> [3]  {CHRISTMAS TREE DECORATION WITH BELL} => {CHRISTMAS TREE STAR DECORATION}      0.005332825  0.7516779 0.007094562 111.17421   112
#> [4]  {CHRISTMAS TREE STAR DECORATION}      => {CHRISTMAS TREE DECORATION WITH BELL} 0.005332825  0.7887324 0.006761261 111.17421   112
#> [5]  {CHRISTMAS TREE HEART DECORATION}     => {CHRISTMAS TREE STAR DECORATION}      0.005428054  0.7125000 0.007618322 105.37975   114
#> [6]  {CHRISTMAS TREE STAR DECORATION}      => {CHRISTMAS TREE HEART DECORATION}     0.005428054  0.8028169 0.006761261 105.37975   114
#> [7]  {CHRISTMAS TREE HEART DECORATION}     => {CHRISTMAS TREE DECORATION WITH BELL} 0.005332825  0.7000000 0.007618322  98.66711   112
#> [8]  {CHRISTMAS TREE DECORATION WITH BELL} => {CHRISTMAS TREE HEART DECORATION}     0.005332825  0.7516779 0.007094562  98.66711   112
#> [9]  {CHILDS GARDEN FORK PINK,                                                                                                        
#>       CHILDS GARDEN TROWEL BLUE,                                                                                                      
#>       CHILDS GARDEN TROWEL PINK}           => {CHILDS GARDEN FORK BLUE}             0.007427864  0.9397590 0.007904009  95.80980   156
#> [10] {CHILDS GARDEN FORK PINK,                                                                                                        
#>       CHILDS GARDEN TROWEL BLUE}           => {CHILDS GARDEN FORK BLUE}             0.007523093  0.9294118 0.008094467  94.75488   158

cat("\nTop 10 by confidence:\n")
#> 
#> Top 10 by confidence:
inspect(head(sort(rules_conf, by = "confidence"), 10))
#>      lhs                                rhs                             support confidence    coverage     lift count
#> [1]  {CHILDRENS GARDEN GLOVES BLUE,                                                                                  
#>       CHILDRENS GARDEN GLOVES PINK,                                                                                  
#>       CHILDS GARDEN TROWEL BLUE}     => {CHILDS GARDEN TROWEL PINK} 0.005618513  1.0000000 0.005618513 78.07435   118
#> [2]  {CHILDRENS GARDEN GLOVES PINK,                                                                                  
#>       CHILDS GARDEN TROWEL BLUE}     => {CHILDS GARDEN TROWEL PINK} 0.006047043  0.9921875 0.006094658 77.46439   127
#> [3]  {CHILDS GARDEN FORK BLUE,                                                                                       
#>       CHILDS GARDEN FORK PINK,                                                                                       
#>       CHILDS GARDEN TROWEL BLUE}     => {CHILDS GARDEN TROWEL PINK} 0.007427864  0.9873418 0.007523093 77.08607   156
#> [4]  {CHILDS GARDEN FORK BLUE,                                                                                       
#>       CHILDS GARDEN FORK PINK,                                                                                       
#>       CHILDS GARDEN TROWEL PINK}     => {CHILDS GARDEN TROWEL BLUE} 0.007427864  0.9873418 0.007523093 80.68542   156
#> [5]  {CHILDS GARDEN FORK PINK,                                                                                       
#>       CHILDS GARDEN TROWEL BLUE}     => {CHILDS GARDEN TROWEL PINK} 0.007904009  0.9764706 0.008094467 76.23731   166
#> [6]  {POPPY'S PLAYHOUSE BATHROOM,                                                                                    
#>       POPPY'S PLAYHOUSE BEDROOM,                                                                                     
#>       POPPY'S PLAYHOUSE LIVINGROOM}  => {POPPY'S PLAYHOUSE KITCHEN} 0.007665937  0.9757576 0.007856395 59.92065   161
#> [7]  {CHILDS GARDEN FORK BLUE,                                                                                       
#>       CHILDS GARDEN TROWEL PINK}     => {CHILDS GARDEN TROWEL BLUE} 0.007808780  0.9704142 0.008046853 79.30210   164
#> [8]  {CHILDRENS GARDEN GLOVES BLUE,                                                                                  
#>       CHILDRENS GARDEN GLOVES PINK,                                                                                  
#>       CHILDS GARDEN TROWEL PINK}     => {CHILDS GARDEN TROWEL BLUE} 0.005618513  0.9672131 0.005808971 79.04051   118
#> [9]  {POPPY'S PLAYHOUSE BATHROOM,                                                                                    
#>       POPPY'S PLAYHOUSE LIVINGROOM}  => {POPPY'S PLAYHOUSE KITCHEN} 0.008189696  0.9662921 0.008475383 59.33938   172
#> [10] {POPPY'S PLAYHOUSE BATHROOM,                                                                                    
#>       POPPY'S PLAYHOUSE BEDROOM}     => {POPPY'S PLAYHOUSE KITCHEN} 0.007999238  0.9655172 0.008284925 59.29179   168

cat("\nTop 10 cross-sell (support>=0.01, lift>=2):\n")
#> 
#> Top 10 cross-sell (support>=0.01, lift>=2):
inspect(head(rules_xsell, 10))
#>      lhs                                  rhs                                  support confidence   coverage     lift count
#> [1]  {CHILDS GARDEN TROWEL BLUE}       => {CHILDS GARDEN TROWEL PINK}       0.01014189  0.8287938 0.01223693 64.70753   213
#> [2]  {CHILDS GARDEN TROWEL PINK}       => {CHILDS GARDEN TROWEL BLUE}       0.01014189  0.7918216 0.01280830 64.70753   213
#> [3]  {POPPY'S PLAYHOUSE LIVINGROOM}    => {POPPY'S PLAYHOUSE BEDROOM}       0.01066565  0.8582375 0.01242739 56.68146   224
#> [4]  {POPPY'S PLAYHOUSE BEDROOM}       => {POPPY'S PLAYHOUSE LIVINGROOM}    0.01066565  0.7044025 0.01514142 56.68146   224
#> [5]  {POPPY'S PLAYHOUSE LIVINGROOM}    => {POPPY'S PLAYHOUSE KITCHEN}       0.01114180  0.8965517 0.01242739 55.05666   234
#> [6]  {POPPY'S PLAYHOUSE KITCHEN}       => {POPPY'S PLAYHOUSE LIVINGROOM}    0.01114180  0.6842105 0.01628416 55.05666   234
#> [7]  {POPPY'S PLAYHOUSE KITCHEN}       => {POPPY'S PLAYHOUSE BEDROOM}       0.01304638  0.8011696 0.01628416 52.91246   274
#> [8]  {POPPY'S PLAYHOUSE BEDROOM}       => {POPPY'S PLAYHOUSE KITCHEN}       0.01304638  0.8616352 0.01514142 52.91246   274
#> [9]  {GREEN REGENCY TEACUP AND SAUCER} => {PINK REGENCY TEACUP AND SAUCER}  0.01061804  0.6092896 0.01742691 49.21654   223
#> [10] {PINK REGENCY TEACUP AND SAUCER}  => {GREEN REGENCY TEACUP AND SAUCER} 0.01061804  0.8576923 0.01237977 49.21654   223

0.6.1 How to interpret a rule row

support is the fraction of all invoices containing both LHS and RHS,
confidence is how often RHS occurs when LHS occurs,
lift quantifies whether the rule is stronger than what you would expect from RHS popularity alone.

For cross-sell candidates, a support threshold (e.g., ≥1%) ensures the pattern appears at non-trivial scale.

0.7 6) Actionable shortlist

For a compact “deployment-style” rule set, we:

keep short antecedents (size(lhs) <= 2),
keep single-item consequents (size(rhs) == 1),
remove operational artifacts (shipping / fees),
enforce balanced thresholds on support/confidence/lift.

rules <- readRDS("rules_final.rds")

rules <- rules[!is.redundant(rules)]
rules <- subset(rules, subset = size(lhs) <= 2 & size(rhs) == 1)

drop_terms <- c("POSTAGE", "DOTCOM", "CARRIAGE")
rules <- subset(rules, subset = !(lhs %pin% paste(drop_terms, collapse="|")))
rules <- subset(rules, subset = !(rhs %pin% paste(drop_terms, collapse="|")))

rules_actionable <- subset(
  rules,
  subset = support >= 0.015 & confidence >= 0.30 & lift >= 1.5 & lift <= 15
)

rules_top <- head(sort(rules_actionable, by = "lift"), 50)

write.csv(as(rules_top, "data.frame"), "rules_actionable_top50.csv", row.names = FALSE)
saveRDS(rules_top, "rules_actionable_top50.rds")

cat("Actionable candidate rules:", length(rules_actionable), "\n")
#> Actionable candidate rules: 134

inspect(head(rules_top, 20))
#>      lhs                                    rhs                                    support confidence   coverage     lift count
#> [1]  {SINGLE HEART ZINC T-LIGHT HOLDER}  => {HANGING HEART ZINC T-LIGHT HOLDER} 0.01980764  0.6265060 0.03161604 13.92368   416
#> [2]  {HANGING HEART ZINC T-LIGHT HOLDER} => {SINGLE HEART ZINC T-LIGHT HOLDER}  0.01980764  0.4402116 0.04499571 13.92368   416
#> [3]  {JUMBO BAG SCANDINAVIAN PAISLEY}    => {JUMBO BAG PINK VINTAGE PAISLEY}    0.01742691  0.4847682 0.03594896 13.15388   366
#> [4]  {JUMBO BAG PINK VINTAGE PAISLEY}    => {JUMBO BAG SCANDINAVIAN PAISLEY}    0.01742691  0.4728682 0.03685363 13.15388   366
#> [5]  {RED SPOTTY CHARLOTTE BAG}          => {STRAWBERRY CHARLOTTE BAG}          0.01552233  0.4174136 0.03718693 12.92997   326
#> [6]  {STRAWBERRY CHARLOTTE BAG}          => {RED SPOTTY CHARLOTTE BAG}          0.01552233  0.4808260 0.03228264 12.92997   326
#> [7]  {VINTAGE SNAP CARDS}                => {VINTAGE HEADS AND TAILS CARD GAME} 0.02114084  0.4435564 0.04766213 12.48736   444
#> [8]  {VINTAGE HEADS AND TAILS CARD GAME} => {VINTAGE SNAP CARDS}                0.02114084  0.5951743 0.03552043 12.48736   444
#> [9]  {WOODEN PICTURE FRAME WHITE FINISH} => {WOODEN FRAME ANTIQUE WHITE}        0.02880678  0.6388596 0.04509094 11.76959   605
#> [10] {WOODEN FRAME ANTIQUE WHITE}        => {WOODEN PICTURE FRAME WHITE FINISH} 0.02880678  0.5307018 0.05428054 11.76959   605
#> [11] {RED SPOTTY CHARLOTTE BAG}          => {WOODLAND CHARLOTTE BAG}            0.01590325  0.4276569 0.03718693 11.75609   334
#> [12] {WOODLAND CHARLOTTE BAG}            => {RED SPOTTY CHARLOTTE BAG}          0.01590325  0.4371728 0.03637749 11.75609   334
#> [13] {FELTCRAFT BUTTERFLY HEARTS}        => {FELTCRAFT 6 FLOWER FRIENDS}        0.01561756  0.4753623 0.03285401 11.46218   328
#> [14] {FELTCRAFT 6 FLOWER FRIENDS}        => {FELTCRAFT BUTTERFLY HEARTS}        0.01561756  0.3765786 0.04147224 11.46218   328
#> [15] {COOK WITH WINE METAL SIGN}         => {GIN + TONIC DIET METAL SIGN}       0.01642701  0.4713115 0.03485382 11.24828   345
#> [16] {GIN + TONIC DIET METAL SIGN}       => {COOK WITH WINE METAL SIGN}         0.01642701  0.3920455 0.04190077 11.24828   345
#> [17] {PAPER CHAIN KIT 50'S CHRISTMAS}    => {PAPER CHAIN KIT VINTAGE CHRISTMAS} 0.01623655  0.3563218 0.04556709 10.86135   341
#> [18] {PAPER CHAIN KIT VINTAGE CHRISTMAS} => {PAPER CHAIN KIT 50'S CHRISTMAS}    0.01623655  0.4949202 0.03280640 10.86135   341
#> [19] {CHOCOLATE HOT WATER BOTTLE}        => {HOT WATER BOTTLE TEA AND SYMPATHY} 0.02280735  0.5161638 0.04418627 10.66976   479
#> [20] {HOT WATER BOTTLE TEA AND SYMPATHY} => {CHOCOLATE HOT WATER BOTTLE}        0.02280735  0.4714567 0.04837635 10.66976   479

0.7.1 Actionable Rule Shortlist (Output-Driven)

Applying the operational criteria yields 134 actionable candidate rules in the output. The strongest rules in this shortlist typically balance non-trivial support with high relative strength (lift). This subset is designed to be deployment-oriented: frequent enough to matter, predictive enough to recommend, and not purely driven by RHS popularity.

0.8 7) Visualization

We visualize the final rules as:

a scatterplot of support vs confidence shaded by lift (global overview),
a graph visualization on a small subset (structure and clusters).

rules <- readRDS("rules_actionable_top50.rds")

plot(
  rules,
  method  = "scatterplot",
  measure = c("support", "confidence"),
  shading = "lift"
)

0.8.1 Global Rule Space: Support vs Confidence

The scatterplot reveals that most rules cluster at low support values, while the highest-lift rules appear as isolated points. This is consistent with the empirical structure of retail baskets: strong associations tend to be localized, and the lift shading highlights which rules are genuinely stronger than what would be expected from baseline item popularity.

rules_net <- head(sort(rules, by = "lift"), 20)
plot(
  rules_net,
  method = "graph",
  engine = "htmlwidget"
)

0.8.2 Network Visualization of Top Rules

The graph-based visualization exposes product clusters (e.g., tightly coupled variants and seasonal bundles) and highlights hub items that frequently appear as RHS. Such hubs represent natural anchor products for recommendation placement and bundle pricing strategies.

0.9 Summary

Based on the observed outputs, purchasing behavior in the Online Retail II dataset is strongly structured rather than random. After appropriate filtering, the extracted association rules are both statistically robust and operationally actionable. The end-to-end pipeline—from cleaning to curated rule sets and visual diagnostics—yields interpretable, deployment-ready insights rather than purely exploratory patterns.

online_retail_ar

Tymoteusz Pawełczyk

2026-02-01