This report presents a complete, end-to-end implementation of association rules mining applied to the Online Retail II dataset. The original data are provided at the invoice-line level, where each observation corresponds to a single product item recorded on a specific invoice. Such a structure is not directly suitable for market-basket analysis, which requires transactional data.
Therefore, the raw invoice-line table is transformed into a transactional representation, where each invoice corresponds to a single basket and products are encoded as binary indicators of presence or absence. This transformation enables the identification of co-occurrence patterns between items purchased together within the same transaction.
The analysis focuses on discovering frequent itemsets and association rules that capture systematic purchasing relationships. To evaluate the strength and relevance of the extracted rules, we rely on the standard metrics used in association rule mining:
Together, these measures allow for a balanced assessment of rule relevance, combining frequency, predictive accuracy, and deviation from random co-occurrence.
The analysis is based on the Online Retail II dataset, publicly available from the UCI Machine Learning Repository. The dataset contains transactional data from a UK-based online retailer selling unique all-occasion giftware. It covers all invoice-line transactions recorded between 01/12/2009 and 09/12/2011, providing a large-scale and well-documented example of real-world retail purchase behavior.
Each observation corresponds to a single product line within an invoice and includes information such as the invoice identifier, stock code, product description, quantity, invoice date, unit price, customer identifier, and country of origin. The dataset has been widely used in the literature and in applied research on market-basket analysis, recommendation systems, and consumer behavior, which supports its suitability and external validity for association rule mining.
The data were obtained from the UCI Machine Learning Repository:
https://archive.ics.uci.edu/dataset/352/online+retail
cat("R Markdown pipeline is running.\n")
#> R Markdown pipeline is running.
We keep only completed sales lines:
df <- read_excel("online_retail_II.xlsx")
df_clean <- df %>%
filter(
!is.na(Invoice),
!is.na(Description),
Quantity > 0,
!str_detect(as.character(Invoice), "^C")
) %>%
mutate(
Invoice = as.character(Invoice),
StockCode = as.character(StockCode),
Description = str_squish(str_to_upper(Description)),
`Customer ID` = as.integer(`Customer ID`),
Country = as.character(Country)
) %>%
select(Invoice, StockCode, Description, Quantity, InvoiceDate, Price, `Customer ID`, Country)
saveRDS(df_clean, "online_retail_clean.rds")
These checks confirm that cleaning worked and give intuition for reasonable parameter ranges later (e.g., minimum support and maximum rule length).
cat("Rows after cleaning:", nrow(df_clean), "\n")
#> Rows after cleaning: 512033
cat("Distinct invoices (transactions):", n_distinct(df_clean$Invoice), "\n")
#> Distinct invoices (transactions): 21002
cat("Distinct items (Description):", n_distinct(df_clean$Description), "\n")
#> Distinct items (Description): 4515
After cleaning, the output indicates 512,033 sales lines, 21,002 distinct invoices (transactions), and 4,515 unique products. This implies an average of approximately 24 line items per invoice, indicating a non-trivial dataset with moderately complex baskets.
The combination of a large item universe and a relatively limited number of transactions implies a highly sparse transactional space. This motivates conservative support thresholds and explicit limits on rule length in later stages to prevent combinatorial explosion and unstable patterns.
df_clean %>%
count(Description, sort = TRUE) %>%
slice_head(n = 20)
#> # A tibble: 20 × 2
#> Description n
#> <chr> <int>
#> 1 WHITE HANGING HEART T-LIGHT HOLDER 3456
#> 2 REGENCY CAKESTAND 3 TIER 2046
#> 3 STRAWBERRY CERAMIC TRINKET BOX 1714
#> 4 PACK OF 72 RETRO SPOT CAKE CASES 1456
#> 5 ASSORTED COLOUR BIRD ORNAMENT 1450
#> 6 60 TEATIME FAIRY CAKE CASES 1394
#> 7 HOME BUILDING BLOCK WORD 1376
#> 8 JUMBO BAG RED RETROSPOT 1280
#> 9 LUNCH BAG RED SPOTTY 1246
#> 10 REX CASH+CARRY JUMBO SHOPPER 1226
#> 11 JUMBO STORAGE BAG SUKI 1203
#> 12 PACK OF 60 PINK PAISLEY CAKE CASES 1191
#> 13 WOODEN FRAME ANTIQUE WHITE 1169
#> 14 LUNCH BAG BLACK SKULL. 1156
#> 15 LUNCH BAG SUKI DESIGN 1146
#> 16 HEART OF WICKER LARGE 1145
#> 17 LOVE BUILDING BLOCK WORD 1129
#> 18 RED HANGING HEART T-LIGHT HOLDER 1106
#> 19 JUMBO SHOPPER VINTAGE RED PAISLEY 1085
#> 20 JUMBO BAG STRAWBERRY 1078
The top-20 frequency table reveals strong demand concentration. The most frequent products include:
These items are globally popular and therefore tend to appear in many baskets. As a consequence, rules involving such items—especially on the RHS—may achieve high confidence even without strong conditional dependence. This reinforces the need to interpret confidence jointly with lift.
df_clean %>%
count(Invoice, name = "n_lines") %>%
summarise(
mean_lines = mean(n_lines),
median_lines = median(n_lines),
p90_lines = quantile(n_lines, 0.90),
p99_lines = quantile(n_lines, 0.99)
)
#> # A tibble: 1 × 4
#> mean_lines median_lines p90_lines p99_lines
#> <dbl> <dbl> <dbl> <dbl>
#> 1 24.4 15 51 180
The invoice-size summary shows a heavy right tail: the median invoice contains 15 items, the mean is 24.4 items, the 90th percentile is 51, and the 99th percentile reaches 180 items. This indicates a small number of very large baskets which can generate a disproportionate number of co-occurrences and inflate higher-order patterns.
This empirical structure directly justifies the later constraints on
rule length (maxlen) and the reliance on minimum support
thresholds to stabilize the mining process.
Association rules are typically mined on binary baskets: whether an item was present in the invoice at least once. We therefore de-duplicate (Invoice, Description) before building baskets.
df_clean <- readRDS("online_retail_clean.rds")
baskets <- df_clean %>%
distinct(Invoice, Description) %>%
group_by(Invoice) %>%
summarise(items = list(Description), .groups = "drop")
trans <- as(baskets$items, "transactions")
saveRDS(trans, "online_retail_transactions.rds")
summary(trans)
#> transactions as itemMatrix in sparse format with
#> 21002 rows (elements/itemsets/transactions) and
#> 4515 columns (items) and a density of 0.005254849
#>
#> most frequent items:
#> WHITE HANGING HEART T-LIGHT HOLDER REGENCY CAKESTAND 3 TIER
#> 3316 2020
#> STRAWBERRY CERAMIC TRINKET BOX ASSORTED COLOUR BIRD ORNAMENT
#> 1640 1413
#> PACK OF 72 RETRO SPOT CAKE CASES (Other)
#> 1410 488487
#>
#> element (itemset/transaction) length distribution:
#> sizes
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
#> 2168 877 724 683 647 618 592 551 615 541 570 561 556 525 546 517
#> 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
#> 512 527 500 491 437 400 369 323 308 294 270 260 246 237 206 179
#> 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
#> 173 162 151 153 144 130 140 153 138 112 107 102 88 86 115 80
#> 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
#> 84 84 82 76 57 56 61 58 64 55 43 49 41 45 37 42
#> 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
#> 40 28 22 28 28 32 34 22 24 27 19 17 15 15 21 15
#> 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
#> 12 11 15 18 9 9 13 14 19 15 7 7 8 8 15 7
#> 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
#> 9 8 10 12 7 4 8 5 8 6 9 9 8 7 6 4
#> 113 114 115 116 117 118 119 120 121 122 123 125 126 127 128 129
#> 8 8 14 7 8 5 5 3 9 5 5 5 4 6 3 3
#> 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145
#> 5 5 6 8 7 9 4 5 5 2 2 2 4 5 2 6
#> 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
#> 4 4 6 3 2 8 1 3 3 4 4 2 5 3 1 4
#> 162 163 164 165 166 167 168 169 170 171 173 174 175 176 177 178
#> 7 2 4 7 3 3 3 4 1 4 2 3 2 3 2 2
#> 180 181 183 184 185 186 187 188 189 190 191 192 193 194 195 196
#> 3 5 3 3 1 4 1 1 3 2 2 3 3 2 2 2
#> 197 199 200 201 202 203 204 205 206 207 211 212 213 216 217 218
#> 2 2 4 4 3 1 2 1 2 2 1 1 1 3 2 4
#> 219 220 221 222 223 224 227 228 229 231 233 237 238 240 241 243
#> 2 2 2 2 2 2 1 1 1 1 1 1 1 2 2 1
#> 244 246 247 249 253 254 255 261 263 264 265 267 268 269 272 274
#> 1 1 3 1 1 2 1 1 1 2 1 1 1 1 2 1
#> 275 276 279 284 295 296 298 299 307 314 320 323 325 330 335 337
#> 2 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1
#> 340 341 343 355 362 368 369 372 376 379 384 400 406 411 416 420
#> 2 1 2 2 1 2 1 1 1 1 1 2 1 1 1 1
#> 424 427 429 435 437 438 439 441 446 448 459 460 463 465 466 476
#> 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1
#> 479 480 483 484 497 498 501 506 512 515 522 536 545 546 556 567
#> 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1
#> 576 577 578 585 588 589 595 601 647 673
#> 1 1 1 1 1 1 1 1 1 1
#>
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.00 6.00 15.00 23.73 28.00 673.00
#>
#> includes extended item information - examples:
#> labels
#> 1 *BOOMBOX IPOD CLASSIC
#> 2 *USB OFFICE GLITTER LAMP
#> 3 *USB OFFICE MIRROR BALL
The transaction summary confirms 21,002 transactions and 4,515 items, with matrix density 0.00525, meaning that more than 99.4% of entries are zeros. The median basket size after de-duplication remains 15 items, while the maximum reaches 673 items.
This extreme sparsity is typical for retail data and implies that most potential item combinations never occur. Consequently, very high-lift rules are expected to be rare but highly informative.
itemFrequencyPlot(trans, topN = 20, type = "absolute")
Interpretation.
This plot shows the most frequent items by transaction count (true
support counts). Very frequent items can dominate high-confidence rules
(because many baskets contain them), which is why lift is important.
basket_sizes <- size(trans)
summary(basket_sizes)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.00 6.00 15.00 23.73 28.00 673.00
Interpretation.
This is the basket size distribution after item de-duplication. If
typical baskets are small, long LHS rules will be rare and often
unstable; short LHS rules are usually the most actionable.
We mine frequent itemsets with:
supp = 0.01 (≥ 1% of transactions),maxlen = 3 to keep results interpretable.Apriori and Eclat should return comparable results given the same thresholds; the difference is mainly computational strategy.
For rules we use:
supp = 0.005 (to allow less frequent
cross-sell patterns),conf = 0.20 to reduce noisy rules,maxlen = 4 and minlen = 2.Then we keep only positive associations (lift > 1.2)
and remove redundant rules.
trans <- readRDS("online_retail_transactions.rds")
fis_apriori <- apriori(
trans,
parameter = list(
target = "frequent itemsets",
supp = 0.01,
maxlen = 3
)
)
#> Apriori
#>
#> Parameter specification:
#> confidence minval smax arem aval originalSupport maxtime support minlen
#> NA 0.1 1 none FALSE TRUE 5 0.01 1
#> maxlen target ext
#> 3 frequent itemsets TRUE
#>
#> Algorithmic control:
#> filter tree heap memopt load sort verbose
#> 0.1 TRUE TRUE FALSE TRUE 2 TRUE
#>
#> Absolute minimum support count: 210
#>
#> set item appearances ...[0 item(s)] done [0.00s].
#> set transactions ...[4515 item(s), 21002 transaction(s)] done [0.14s].
#> sorting and recoding items ... [703 item(s)] done [0.01s].
#> creating transaction tree ... done [0.01s].
#> checking subsets of size 1 2 3
#> done [0.01s].
#> sorting transactions ... done [0.00s].
#> writing ... [1100 set(s)] done [0.00s].
#> creating S4 object ... done [0.00s].
fis_eclat <- eclat(
trans,
parameter = list(
supp = 0.01,
maxlen = 3
)
)
#> Eclat
#>
#> parameter specification:
#> tidLists support minlen maxlen target ext
#> FALSE 0.01 1 3 frequent itemsets TRUE
#>
#> algorithmic control:
#> sparse sort verbose
#> 7 -2 TRUE
#>
#> Absolute minimum support count: 210
#>
#> create itemset ...
#> set transactions ...[4515 item(s), 21002 transaction(s)] done [0.11s].
#> sorting and recoding items ... [703 item(s)] done [0.01s].
#> creating sparse bit matrix ... [703 row(s), 21002 column(s)] done [0.00s].
#> writing ... [1100 set(s)] done [0.46s].
#> Creating S4 object ... done [0.00s].
rules_raw <- apriori(
trans,
parameter = list(
supp = 0.005,
conf = 0.20,
minlen = 2,
maxlen = 4
)
)
#> Apriori
#>
#> Parameter specification:
#> confidence minval smax arem aval originalSupport maxtime support minlen
#> 0.2 0.1 1 none FALSE TRUE 5 0.005 2
#> maxlen target ext
#> 4 rules TRUE
#>
#> Algorithmic control:
#> filter tree heap memopt load sort verbose
#> 0.1 TRUE TRUE FALSE TRUE 2 TRUE
#>
#> Absolute minimum support count: 105
#>
#> set item appearances ...[0 item(s)] done [0.00s].
#> set transactions ...[4515 item(s), 21002 transaction(s)] done [0.12s].
#> sorting and recoding items ... [1383 item(s)] done [0.01s].
#> creating transaction tree ... done [0.00s].
#> checking subsets of size 1 2 3 4
#> done [0.05s].
#> writing ... [5365 rule(s)] done [0.00s].
#> creating S4 object ... done [0.00s].
rules <- subset(rules_raw, subset = lift > 1.2)
rules <- rules[!is.redundant(rules)]
saveRDS(fis_apriori, "fis_apriori.rds")
saveRDS(fis_eclat, "fis_eclat.rds")
saveRDS(rules, "rules_final.rds")
cat("Frequent itemsets (Apriori):", length(fis_apriori), "\n")
#> Frequent itemsets (Apriori): 1100
cat("Frequent itemsets (Eclat):", length(fis_eclat), "\n")
#> Frequent itemsets (Eclat): 1100
cat("Rules (filtered & non-redundant):", length(rules), "\n")
#> Rules (filtered & non-redundant): 5299
Using identical parameters (supp = 0.01,
maxlen = 3), both Apriori and Eclat return the same number
of frequent itemsets (1,100 in the output). This
indicates that the mined frequent structures are robust to algorithm
choice and primarily reflect genuine co-occurrence patterns rather than
mining artifacts.
inspect(head(sort(fis_apriori, by = "support"), 20))
#> items support count
#> [1] {WHITE HANGING HEART T-LIGHT HOLDER} 0.15788972 3316
#> [2] {REGENCY CAKESTAND 3 TIER} 0.09618132 2020
#> [3] {STRAWBERRY CERAMIC TRINKET BOX} 0.07808780 1640
#> [4] {ASSORTED COLOUR BIRD ORNAMENT} 0.06727931 1413
#> [5] {PACK OF 72 RETRO SPOT CAKE CASES} 0.06713646 1410
#> [6] {60 TEATIME FAIRY CAKE CASES} 0.06361299 1336
#> [7] {HOME BUILDING BLOCK WORD} 0.06337492 1331
#> [8] {JUMBO BAG RED RETROSPOT} 0.05937530 1247
#> [9] {LUNCH BAG RED SPOTTY} 0.05799448 1218
#> [10] {JUMBO STORAGE BAG SUKI} 0.05618513 1180
#> [11] {PACK OF 60 PINK PAISLEY CAKE CASES} 0.05499476 1155
#> [12] {WOODEN FRAME ANTIQUE WHITE} 0.05428054 1140
#> [13] {LUNCH BAG BLACK SKULL.} 0.05309018 1115
#> [14] {LUNCH BAG SUKI DESIGN} 0.05304257 1114
#> [15] {HEART OF WICKER LARGE} 0.05218551 1096
#> [16] {LOVE BUILDING BLOCK WORD} 0.05194743 1091
#> [17] {REX CASH+CARRY JUMBO SHOPPER} 0.05118560 1075
#> [18] {RED HANGING HEART T-LIGHT HOLDER} 0.05075707 1066
#> [19] {JUMBO SHOPPER VINTAGE RED PAISLEY} 0.05037615 1058
#> [20] {JUMBO BAG STRAWBERRY} 0.05032854 1057
inspect(head(sort(fis_eclat, by = "support"), 20))
#> items support count
#> [1] {WHITE HANGING HEART T-LIGHT HOLDER} 0.15788972 3316
#> [2] {REGENCY CAKESTAND 3 TIER} 0.09618132 2020
#> [3] {STRAWBERRY CERAMIC TRINKET BOX} 0.07808780 1640
#> [4] {ASSORTED COLOUR BIRD ORNAMENT} 0.06727931 1413
#> [5] {PACK OF 72 RETRO SPOT CAKE CASES} 0.06713646 1410
#> [6] {60 TEATIME FAIRY CAKE CASES} 0.06361299 1336
#> [7] {HOME BUILDING BLOCK WORD} 0.06337492 1331
#> [8] {JUMBO BAG RED RETROSPOT} 0.05937530 1247
#> [9] {LUNCH BAG RED SPOTTY} 0.05799448 1218
#> [10] {JUMBO STORAGE BAG SUKI} 0.05618513 1180
#> [11] {PACK OF 60 PINK PAISLEY CAKE CASES} 0.05499476 1155
#> [12] {WOODEN FRAME ANTIQUE WHITE} 0.05428054 1140
#> [13] {LUNCH BAG BLACK SKULL.} 0.05309018 1115
#> [14] {LUNCH BAG SUKI DESIGN} 0.05304257 1114
#> [15] {HEART OF WICKER LARGE} 0.05218551 1096
#> [16] {LOVE BUILDING BLOCK WORD} 0.05194743 1091
#> [17] {REX CASH+CARRY JUMBO SHOPPER} 0.05118560 1075
#> [18] {RED HANGING HEART T-LIGHT HOLDER} 0.05075707 1066
#> [19] {JUMBO SHOPPER VINTAGE RED PAISLEY} 0.05037615 1058
#> [20] {JUMBO BAG STRAWBERRY} 0.05032854 1057
Interpretation.
The most supported itemsets correspond to core products and common
bundles. Similar top results across Apriori and Eclat is a strong
consistency check.
inspect(head(sort(rules, by = "lift"), 20))
#> lhs rhs support confidence coverage lift count
#> [1] {CAST IRON HOOK GARDEN FORK} => {CAST IRON HOOK GARDEN TROWEL} 0.006523188 0.8726115 0.007475479 118.23604 137
#> [2] {CAST IRON HOOK GARDEN TROWEL} => {CAST IRON HOOK GARDEN FORK} 0.006523188 0.8838710 0.007380250 118.23604 137
#> [3] {CHRISTMAS TREE DECORATION WITH BELL} => {CHRISTMAS TREE STAR DECORATION} 0.005332825 0.7516779 0.007094562 111.17421 112
#> [4] {CHRISTMAS TREE STAR DECORATION} => {CHRISTMAS TREE DECORATION WITH BELL} 0.005332825 0.7887324 0.006761261 111.17421 112
#> [5] {CHRISTMAS TREE HEART DECORATION} => {CHRISTMAS TREE STAR DECORATION} 0.005428054 0.7125000 0.007618322 105.37975 114
#> [6] {CHRISTMAS TREE STAR DECORATION} => {CHRISTMAS TREE HEART DECORATION} 0.005428054 0.8028169 0.006761261 105.37975 114
#> [7] {CHRISTMAS TREE HEART DECORATION} => {CHRISTMAS TREE DECORATION WITH BELL} 0.005332825 0.7000000 0.007618322 98.66711 112
#> [8] {CHRISTMAS TREE DECORATION WITH BELL} => {CHRISTMAS TREE HEART DECORATION} 0.005332825 0.7516779 0.007094562 98.66711 112
#> [9] {CHILDS GARDEN FORK PINK,
#> CHILDS GARDEN TROWEL BLUE,
#> CHILDS GARDEN TROWEL PINK} => {CHILDS GARDEN FORK BLUE} 0.007427864 0.9397590 0.007904009 95.80980 156
#> [10] {CHILDS GARDEN FORK PINK,
#> CHILDS GARDEN TROWEL BLUE} => {CHILDS GARDEN FORK BLUE} 0.007523093 0.9294118 0.008094467 94.75488 158
#> [11] {CHILDS GARDEN FORK BLUE,
#> CHILDS GARDEN TROWEL BLUE,
#> CHILDS GARDEN TROWEL PINK} => {CHILDS GARDEN FORK PINK} 0.007427864 0.9512195 0.007808780 91.22152 156
#> [12] {CHILDS GARDEN FORK BLUE,
#> CHILDS GARDEN TROWEL PINK} => {CHILDS GARDEN FORK PINK} 0.007523093 0.9349112 0.008046853 89.65756 158
#> [13] {BLUE FELT EASTER EGG BASKET} => {PINK FELT EASTER EGG BASKET} 0.005904200 0.7607362 0.007761166 89.25688 124
#> [14] {PINK FELT EASTER EGG BASKET} => {BLUE FELT EASTER EGG BASKET} 0.005904200 0.6927374 0.008522998 89.25688 124
#> [15] {CHILDRENS GARDEN GLOVES PINK,
#> CHILDS GARDEN TROWEL BLUE,
#> CHILDS GARDEN TROWEL PINK} => {CHILDRENS GARDEN GLOVES BLUE} 0.005618513 0.9291339 0.006047043 88.69850 118
#> [16] {CHILDRENS GARDEN GLOVES PINK,
#> CHILDS GARDEN TROWEL BLUE} => {CHILDRENS GARDEN GLOVES BLUE} 0.005618513 0.9218750 0.006094658 88.00554 118
#> [17] {PACK OF 20 SKULL PAPER NAPKINS,
#> PACK OF 6 SKULL PAPER CUPS} => {PACK OF 6 SKULL PAPER PLATES} 0.006570803 0.9200000 0.007142177 85.49487 138
#> [18] {POPPY'S PLAYHOUSE BEDROOM,
#> POPPY'S PLAYHOUSE KITCHEN,
#> POPPY'S PLAYHOUSE LIVINGROOM} => {POPPY'S PLAYHOUSE BATHROOM} 0.007665937 0.7815534 0.009808590 85.49054 161
#> [19] {ENVELOPE 50 ROMANTIC IMAGES} => {ENVELOPE 50 BLOSSOM IMAGES} 0.005475669 0.6725146 0.008142082 84.07233 115
#> [20] {ENVELOPE 50 BLOSSOM IMAGES} => {ENVELOPE 50 ROMANTIC IMAGES} 0.005475669 0.6845238 0.007999238 84.07233 115
inspect(head(sort(rules, by = "confidence"), 20))
#> lhs rhs support confidence coverage lift count
#> [1] {CHILDRENS GARDEN GLOVES BLUE,
#> CHILDRENS GARDEN GLOVES PINK,
#> CHILDS GARDEN TROWEL BLUE} => {CHILDS GARDEN TROWEL PINK} 0.005618513 1.0000000 0.005618513 78.07435 118
#> [2] {CHILDRENS GARDEN GLOVES PINK,
#> CHILDS GARDEN TROWEL BLUE} => {CHILDS GARDEN TROWEL PINK} 0.006047043 0.9921875 0.006094658 77.46439 127
#> [3] {CHILDS GARDEN FORK BLUE,
#> CHILDS GARDEN FORK PINK,
#> CHILDS GARDEN TROWEL BLUE} => {CHILDS GARDEN TROWEL PINK} 0.007427864 0.9873418 0.007523093 77.08607 156
#> [4] {CHILDS GARDEN FORK BLUE,
#> CHILDS GARDEN FORK PINK,
#> CHILDS GARDEN TROWEL PINK} => {CHILDS GARDEN TROWEL BLUE} 0.007427864 0.9873418 0.007523093 80.68542 156
#> [5] {CHILDS GARDEN FORK PINK,
#> CHILDS GARDEN TROWEL BLUE} => {CHILDS GARDEN TROWEL PINK} 0.007904009 0.9764706 0.008094467 76.23731 166
#> [6] {POPPY'S PLAYHOUSE BATHROOM,
#> POPPY'S PLAYHOUSE BEDROOM,
#> POPPY'S PLAYHOUSE LIVINGROOM} => {POPPY'S PLAYHOUSE KITCHEN} 0.007665937 0.9757576 0.007856395 59.92065 161
#> [7] {CHILDS GARDEN FORK BLUE,
#> CHILDS GARDEN TROWEL PINK} => {CHILDS GARDEN TROWEL BLUE} 0.007808780 0.9704142 0.008046853 79.30210 164
#> [8] {CHILDRENS GARDEN GLOVES BLUE,
#> CHILDRENS GARDEN GLOVES PINK,
#> CHILDS GARDEN TROWEL PINK} => {CHILDS GARDEN TROWEL BLUE} 0.005618513 0.9672131 0.005808971 79.04051 118
#> [9] {POPPY'S PLAYHOUSE BATHROOM,
#> POPPY'S PLAYHOUSE LIVINGROOM} => {POPPY'S PLAYHOUSE KITCHEN} 0.008189696 0.9662921 0.008475383 59.33938 172
#> [10] {POPPY'S PLAYHOUSE BATHROOM,
#> POPPY'S PLAYHOUSE BEDROOM} => {POPPY'S PLAYHOUSE KITCHEN} 0.007999238 0.9655172 0.008284925 59.29179 168
#> [11] {CHILDRENS GARDEN GLOVES PINK,
#> CHILDS GARDEN FORK PINK} => {CHILDS GARDEN TROWEL PINK} 0.006094658 0.9624060 0.006332730 75.13922 128
#> [12] {CHILDRENS GARDEN GLOVES BLUE,
#> CHILDS GARDEN TROWEL BLUE,
#> CHILDS GARDEN TROWEL PINK} => {CHILDRENS GARDEN GLOVES PINK} 0.005618513 0.9593496 0.005856585 75.74534 118
#> [13] {POPPY'S PLAYHOUSE BATHROOM,
#> POPPY'S PLAYHOUSE BEDROOM,
#> POPPY'S PLAYHOUSE KITCHEN} => {POPPY'S PLAYHOUSE LIVINGROOM} 0.007665937 0.9583333 0.007999238 77.11462 161
#> [14] {ENVELOPE 50 BLOSSOM IMAGES,
#> STRAWBERRY CERAMIC TRINKET BOX} => {DOTCOM POSTAGE} 0.005237596 0.9565217 0.005475669 27.40637 110
#> [15] {CHILDRENS GARDEN GLOVES BLUE,
#> CHILDS GARDEN TROWEL PINK} => {CHILDS GARDEN TROWEL BLUE} 0.005856585 0.9534884 0.006142272 77.91892 123
#> [16] {CHILDS GARDEN FORK BLUE,
#> CHILDS GARDEN TROWEL BLUE,
#> CHILDS GARDEN TROWEL PINK} => {CHILDS GARDEN FORK PINK} 0.007427864 0.9512195 0.007808780 91.22152 156
#> [17] {POPPY'S PLAYHOUSE BATHROOM,
#> POPPY'S PLAYHOUSE BEDROOM} => {POPPY'S PLAYHOUSE LIVINGROOM} 0.007856395 0.9482759 0.008284925 76.30532 165
#> [18] {POPPY'S PLAYHOUSE BATHROOM} => {POPPY'S PLAYHOUSE KITCHEN} 0.008665841 0.9479167 0.009141986 58.21095 182
#> [19] {CHILDS GARDEN FORK BLUE,
#> CHILDS GARDEN FORK PINK} => {CHILDS GARDEN TROWEL BLUE} 0.007523093 0.9461078 0.007951624 77.31578 158
#> [20] {CHILDS GARDEN FORK BLUE,
#> CHILDS GARDEN FORK PINK} => {CHILDS GARDEN TROWEL PINK} 0.007523093 0.9461078 0.007951624 73.86675 158
The top rules by lift are characterized by extremely high lift values (often above 100) while retaining moderate support (~0.5–0.8%). This typically reflects tightly coupled product pairs or variants (e.g., complementary tools or matched seasonal decorations). Although niche in absolute frequency, these rules are exceptionally strong in relative terms and are well suited for targeted recommendations.
Rules ranked by confidence often approach 1.0, implying near-deterministic relationships. These remain analytically meaningful in the output because they are accompanied by very high lift, which confirms that the patterns are not simply driven by globally popular RHS items.
To make results presentation-friendly we focus on:
rules_raw <- readRDS("rules_final.rds")
rules_nr <- rules_raw[!is.redundant(rules_raw)]
rules_rhs1 <- subset(rules_nr, subset = size(rhs) == 1)
rules_bundle <- head(sort(rules_rhs1, by = "lift"), 30)
rules_conf <- head(sort(rules_rhs1, by = "confidence"), 30)
rules_xsell <- rules_rhs1 %>%
subset(subset = support >= 0.01 & lift >= 2) %>%
sort(by = "lift") %>%
head(30)
write.csv(as(rules_bundle, "data.frame"), "rules_top_lift.csv", row.names = FALSE)
write.csv(as(rules_conf, "data.frame"), "rules_top_confidence.csv", row.names = FALSE)
write.csv(as(rules_xsell, "data.frame"), "rules_cross_sell.csv", row.names = FALSE)
cat("Rules loaded:", length(rules_raw), "\n")
#> Rules loaded: 5299
cat("Non-redundant rules:", length(rules_nr), "\n")
#> Non-redundant rules: 5299
cat("Non-redundant rules with |RHS|=1:", length(rules_rhs1), "\n")
#> Non-redundant rules with |RHS|=1: 5299
cat("\nTop 10 by lift:\n")
#>
#> Top 10 by lift:
inspect(head(sort(rules_bundle, by = "lift"), 10))
#> lhs rhs support confidence coverage lift count
#> [1] {CAST IRON HOOK GARDEN FORK} => {CAST IRON HOOK GARDEN TROWEL} 0.006523188 0.8726115 0.007475479 118.23604 137
#> [2] {CAST IRON HOOK GARDEN TROWEL} => {CAST IRON HOOK GARDEN FORK} 0.006523188 0.8838710 0.007380250 118.23604 137
#> [3] {CHRISTMAS TREE DECORATION WITH BELL} => {CHRISTMAS TREE STAR DECORATION} 0.005332825 0.7516779 0.007094562 111.17421 112
#> [4] {CHRISTMAS TREE STAR DECORATION} => {CHRISTMAS TREE DECORATION WITH BELL} 0.005332825 0.7887324 0.006761261 111.17421 112
#> [5] {CHRISTMAS TREE HEART DECORATION} => {CHRISTMAS TREE STAR DECORATION} 0.005428054 0.7125000 0.007618322 105.37975 114
#> [6] {CHRISTMAS TREE STAR DECORATION} => {CHRISTMAS TREE HEART DECORATION} 0.005428054 0.8028169 0.006761261 105.37975 114
#> [7] {CHRISTMAS TREE HEART DECORATION} => {CHRISTMAS TREE DECORATION WITH BELL} 0.005332825 0.7000000 0.007618322 98.66711 112
#> [8] {CHRISTMAS TREE DECORATION WITH BELL} => {CHRISTMAS TREE HEART DECORATION} 0.005332825 0.7516779 0.007094562 98.66711 112
#> [9] {CHILDS GARDEN FORK PINK,
#> CHILDS GARDEN TROWEL BLUE,
#> CHILDS GARDEN TROWEL PINK} => {CHILDS GARDEN FORK BLUE} 0.007427864 0.9397590 0.007904009 95.80980 156
#> [10] {CHILDS GARDEN FORK PINK,
#> CHILDS GARDEN TROWEL BLUE} => {CHILDS GARDEN FORK BLUE} 0.007523093 0.9294118 0.008094467 94.75488 158
cat("\nTop 10 by confidence:\n")
#>
#> Top 10 by confidence:
inspect(head(sort(rules_conf, by = "confidence"), 10))
#> lhs rhs support confidence coverage lift count
#> [1] {CHILDRENS GARDEN GLOVES BLUE,
#> CHILDRENS GARDEN GLOVES PINK,
#> CHILDS GARDEN TROWEL BLUE} => {CHILDS GARDEN TROWEL PINK} 0.005618513 1.0000000 0.005618513 78.07435 118
#> [2] {CHILDRENS GARDEN GLOVES PINK,
#> CHILDS GARDEN TROWEL BLUE} => {CHILDS GARDEN TROWEL PINK} 0.006047043 0.9921875 0.006094658 77.46439 127
#> [3] {CHILDS GARDEN FORK BLUE,
#> CHILDS GARDEN FORK PINK,
#> CHILDS GARDEN TROWEL BLUE} => {CHILDS GARDEN TROWEL PINK} 0.007427864 0.9873418 0.007523093 77.08607 156
#> [4] {CHILDS GARDEN FORK BLUE,
#> CHILDS GARDEN FORK PINK,
#> CHILDS GARDEN TROWEL PINK} => {CHILDS GARDEN TROWEL BLUE} 0.007427864 0.9873418 0.007523093 80.68542 156
#> [5] {CHILDS GARDEN FORK PINK,
#> CHILDS GARDEN TROWEL BLUE} => {CHILDS GARDEN TROWEL PINK} 0.007904009 0.9764706 0.008094467 76.23731 166
#> [6] {POPPY'S PLAYHOUSE BATHROOM,
#> POPPY'S PLAYHOUSE BEDROOM,
#> POPPY'S PLAYHOUSE LIVINGROOM} => {POPPY'S PLAYHOUSE KITCHEN} 0.007665937 0.9757576 0.007856395 59.92065 161
#> [7] {CHILDS GARDEN FORK BLUE,
#> CHILDS GARDEN TROWEL PINK} => {CHILDS GARDEN TROWEL BLUE} 0.007808780 0.9704142 0.008046853 79.30210 164
#> [8] {CHILDRENS GARDEN GLOVES BLUE,
#> CHILDRENS GARDEN GLOVES PINK,
#> CHILDS GARDEN TROWEL PINK} => {CHILDS GARDEN TROWEL BLUE} 0.005618513 0.9672131 0.005808971 79.04051 118
#> [9] {POPPY'S PLAYHOUSE BATHROOM,
#> POPPY'S PLAYHOUSE LIVINGROOM} => {POPPY'S PLAYHOUSE KITCHEN} 0.008189696 0.9662921 0.008475383 59.33938 172
#> [10] {POPPY'S PLAYHOUSE BATHROOM,
#> POPPY'S PLAYHOUSE BEDROOM} => {POPPY'S PLAYHOUSE KITCHEN} 0.007999238 0.9655172 0.008284925 59.29179 168
cat("\nTop 10 cross-sell (support>=0.01, lift>=2):\n")
#>
#> Top 10 cross-sell (support>=0.01, lift>=2):
inspect(head(rules_xsell, 10))
#> lhs rhs support confidence coverage lift count
#> [1] {CHILDS GARDEN TROWEL BLUE} => {CHILDS GARDEN TROWEL PINK} 0.01014189 0.8287938 0.01223693 64.70753 213
#> [2] {CHILDS GARDEN TROWEL PINK} => {CHILDS GARDEN TROWEL BLUE} 0.01014189 0.7918216 0.01280830 64.70753 213
#> [3] {POPPY'S PLAYHOUSE LIVINGROOM} => {POPPY'S PLAYHOUSE BEDROOM} 0.01066565 0.8582375 0.01242739 56.68146 224
#> [4] {POPPY'S PLAYHOUSE BEDROOM} => {POPPY'S PLAYHOUSE LIVINGROOM} 0.01066565 0.7044025 0.01514142 56.68146 224
#> [5] {POPPY'S PLAYHOUSE LIVINGROOM} => {POPPY'S PLAYHOUSE KITCHEN} 0.01114180 0.8965517 0.01242739 55.05666 234
#> [6] {POPPY'S PLAYHOUSE KITCHEN} => {POPPY'S PLAYHOUSE LIVINGROOM} 0.01114180 0.6842105 0.01628416 55.05666 234
#> [7] {POPPY'S PLAYHOUSE KITCHEN} => {POPPY'S PLAYHOUSE BEDROOM} 0.01304638 0.8011696 0.01628416 52.91246 274
#> [8] {POPPY'S PLAYHOUSE BEDROOM} => {POPPY'S PLAYHOUSE KITCHEN} 0.01304638 0.8616352 0.01514142 52.91246 274
#> [9] {GREEN REGENCY TEACUP AND SAUCER} => {PINK REGENCY TEACUP AND SAUCER} 0.01061804 0.6092896 0.01742691 49.21654 223
#> [10] {PINK REGENCY TEACUP AND SAUCER} => {GREEN REGENCY TEACUP AND SAUCER} 0.01061804 0.8576923 0.01237977 49.21654 223
For cross-sell candidates, a support threshold (e.g., ≥1%) ensures the pattern appears at non-trivial scale.
For a compact “deployment-style” rule set, we:
size(lhs) <= 2),size(rhs) == 1),rules <- readRDS("rules_final.rds")
rules <- rules[!is.redundant(rules)]
rules <- subset(rules, subset = size(lhs) <= 2 & size(rhs) == 1)
drop_terms <- c("POSTAGE", "DOTCOM", "CARRIAGE")
rules <- subset(rules, subset = !(lhs %pin% paste(drop_terms, collapse="|")))
rules <- subset(rules, subset = !(rhs %pin% paste(drop_terms, collapse="|")))
rules_actionable <- subset(
rules,
subset = support >= 0.015 & confidence >= 0.30 & lift >= 1.5 & lift <= 15
)
rules_top <- head(sort(rules_actionable, by = "lift"), 50)
write.csv(as(rules_top, "data.frame"), "rules_actionable_top50.csv", row.names = FALSE)
saveRDS(rules_top, "rules_actionable_top50.rds")
cat("Actionable candidate rules:", length(rules_actionable), "\n")
#> Actionable candidate rules: 134
inspect(head(rules_top, 20))
#> lhs rhs support confidence coverage lift count
#> [1] {SINGLE HEART ZINC T-LIGHT HOLDER} => {HANGING HEART ZINC T-LIGHT HOLDER} 0.01980764 0.6265060 0.03161604 13.92368 416
#> [2] {HANGING HEART ZINC T-LIGHT HOLDER} => {SINGLE HEART ZINC T-LIGHT HOLDER} 0.01980764 0.4402116 0.04499571 13.92368 416
#> [3] {JUMBO BAG SCANDINAVIAN PAISLEY} => {JUMBO BAG PINK VINTAGE PAISLEY} 0.01742691 0.4847682 0.03594896 13.15388 366
#> [4] {JUMBO BAG PINK VINTAGE PAISLEY} => {JUMBO BAG SCANDINAVIAN PAISLEY} 0.01742691 0.4728682 0.03685363 13.15388 366
#> [5] {RED SPOTTY CHARLOTTE BAG} => {STRAWBERRY CHARLOTTE BAG} 0.01552233 0.4174136 0.03718693 12.92997 326
#> [6] {STRAWBERRY CHARLOTTE BAG} => {RED SPOTTY CHARLOTTE BAG} 0.01552233 0.4808260 0.03228264 12.92997 326
#> [7] {VINTAGE SNAP CARDS} => {VINTAGE HEADS AND TAILS CARD GAME} 0.02114084 0.4435564 0.04766213 12.48736 444
#> [8] {VINTAGE HEADS AND TAILS CARD GAME} => {VINTAGE SNAP CARDS} 0.02114084 0.5951743 0.03552043 12.48736 444
#> [9] {WOODEN PICTURE FRAME WHITE FINISH} => {WOODEN FRAME ANTIQUE WHITE} 0.02880678 0.6388596 0.04509094 11.76959 605
#> [10] {WOODEN FRAME ANTIQUE WHITE} => {WOODEN PICTURE FRAME WHITE FINISH} 0.02880678 0.5307018 0.05428054 11.76959 605
#> [11] {RED SPOTTY CHARLOTTE BAG} => {WOODLAND CHARLOTTE BAG} 0.01590325 0.4276569 0.03718693 11.75609 334
#> [12] {WOODLAND CHARLOTTE BAG} => {RED SPOTTY CHARLOTTE BAG} 0.01590325 0.4371728 0.03637749 11.75609 334
#> [13] {FELTCRAFT BUTTERFLY HEARTS} => {FELTCRAFT 6 FLOWER FRIENDS} 0.01561756 0.4753623 0.03285401 11.46218 328
#> [14] {FELTCRAFT 6 FLOWER FRIENDS} => {FELTCRAFT BUTTERFLY HEARTS} 0.01561756 0.3765786 0.04147224 11.46218 328
#> [15] {COOK WITH WINE METAL SIGN} => {GIN + TONIC DIET METAL SIGN} 0.01642701 0.4713115 0.03485382 11.24828 345
#> [16] {GIN + TONIC DIET METAL SIGN} => {COOK WITH WINE METAL SIGN} 0.01642701 0.3920455 0.04190077 11.24828 345
#> [17] {PAPER CHAIN KIT 50'S CHRISTMAS} => {PAPER CHAIN KIT VINTAGE CHRISTMAS} 0.01623655 0.3563218 0.04556709 10.86135 341
#> [18] {PAPER CHAIN KIT VINTAGE CHRISTMAS} => {PAPER CHAIN KIT 50'S CHRISTMAS} 0.01623655 0.4949202 0.03280640 10.86135 341
#> [19] {CHOCOLATE HOT WATER BOTTLE} => {HOT WATER BOTTLE TEA AND SYMPATHY} 0.02280735 0.5161638 0.04418627 10.66976 479
#> [20] {HOT WATER BOTTLE TEA AND SYMPATHY} => {CHOCOLATE HOT WATER BOTTLE} 0.02280735 0.4714567 0.04837635 10.66976 479
Applying the operational criteria yields 134 actionable candidate rules in the output. The strongest rules in this shortlist typically balance non-trivial support with high relative strength (lift). This subset is designed to be deployment-oriented: frequent enough to matter, predictive enough to recommend, and not purely driven by RHS popularity.
We visualize the final rules as:
rules <- readRDS("rules_actionable_top50.rds")
plot(
rules,
method = "scatterplot",
measure = c("support", "confidence"),
shading = "lift"
)
The scatterplot reveals that most rules cluster at low support values, while the highest-lift rules appear as isolated points. This is consistent with the empirical structure of retail baskets: strong associations tend to be localized, and the lift shading highlights which rules are genuinely stronger than what would be expected from baseline item popularity.
rules_net <- head(sort(rules, by = "lift"), 20)
plot(
rules_net,
method = "graph",
engine = "htmlwidget"
)
The graph-based visualization exposes product clusters (e.g., tightly coupled variants and seasonal bundles) and highlights hub items that frequently appear as RHS. Such hubs represent natural anchor products for recommendation placement and bundle pricing strategies.
Based on the observed outputs, purchasing behavior in the Online Retail II dataset is strongly structured rather than random. After appropriate filtering, the extracted association rules are both statistically robust and operationally actionable. The end-to-end pipeline—from cleaning to curated rule sets and visual diagnostics—yields interpretable, deployment-ready insights rather than purely exploratory patterns.