Association rule mining (market-basket analysis) discovers co-occurrence patterns in transactional data. Typical business applications include: - product bundling and cross-selling, - store layout optimization, - recommendation systems, - identifying complementary vs. substitute items.
Which item combinations and association rules are most informative in the Groceries transaction dataset, and how do support/confidence/lift trade off under different threshold choices?
arules)We use the classic Groceries dataset from the
arules package: - ~9,800 transactions (shopping baskets), -
169 unique items.
This dataset is widely used for teaching association rules because it is large enough to be realistic but still manageable.
# This report relies on a few CRAN packages. If they are missing, we install them once.
required_pkgs <- c("arules", "ggplot2", "dplyr", "tidyr")
missing <- required_pkgs[!vapply(required_pkgs, requireNamespace, quietly = TRUE, FUN.VALUE = logical(1))]
if (length(missing) > 0) {
message("Installing missing packages: ", paste(missing, collapse = ", "))
install.packages(missing, repos = "https://cloud.r-project.org")
}
# Load libraries (fail fast if installation was blocked)
invisible(lapply(required_pkgs, library, character.only = TRUE))
# Optional: nicer visualizations (not required). We only use it if installed.
has_arulesviz <- requireNamespace("arulesViz", quietly = TRUE)
if (has_arulesviz) library(arulesViz)
data("Groceries") # loads an object named Groceries
Groceries
## transactions in sparse format with
## 9835 transactions (rows) and
## 169 items (columns)
summary(Groceries)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.0261
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 3.00 4.41 6.00 32.00
##
## includes extended item information - examples:
## labels level2 level1
## 1 frankfurter sausage meat and sausage
## 2 sausage sausage meat and sausage
## 3 liver loaf sausage meat and sausage
n_trans <- length(Groceries)
n_items <- length(itemLabels(Groceries))
# Total number of item occurrences across all transactions
n_nonzero <- sum(size(Groceries))
# Transaction matrix density = nonzero / (n_trans * n_items)
dens <- n_nonzero / (n_trans * n_items)
cat("Number of transactions:", n_trans, "
")
## Number of transactions: 9835
cat("Number of items:", n_items, "
")
## Number of items: 169
cat("Total item occurrences:", n_nonzero, "
")
## Total item occurrences: 43367
cat("Density (non-zero / total possible):", round(dens, 4), "
")
## Density (non-zero / total possible): 0.0261
item_freq <- itemFrequency(Groceries, type = "relative")
item_freq_df <- data.frame(
item = names(item_freq),
frequency = as.numeric(item_freq)
) %>%
arrange(desc(frequency))
head(item_freq_df, 15)
top_n <- 20
ggplot(item_freq_df[1:top_n, ], aes(x = reorder(item, frequency), y = frequency)) +
geom_col() +
coord_flip() +
labs(
title = paste("Top", top_n, "items by relative frequency"),
x = "",
y = "Relative frequency"
)
Thresholds are a trade-off: - too strict → too few rules, - too loose → too many noisy rules.
We start with a baseline parameter setting designed to yield a readable number of rules: - min support = 0.01 (≥ 1% of baskets), - min confidence = 0.30, - rule length between 2 and 4 items total.
rules_base <- apriori(
Groceries,
parameter = list(
supp = 0.01,
conf = 0.30,
minlen = 2,
maxlen = 4,
target = "rules"
)
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.3 0.1 1 none FALSE TRUE 5 0.01 2
## maxlen target ext
## 4 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 98
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4
## done [0.00s].
## writing ... [125 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules_base
## set of 125 rules
rules_top <- sort(rules_base, by = "lift", decreasing = TRUE)
inspect(head(rules_top, 15))
## lhs rhs support
## [1] {citrus fruit, other vegetables} => {root vegetables} 0.0104
## [2] {tropical fruit, other vegetables} => {root vegetables} 0.0123
## [3] {beef} => {root vegetables} 0.0174
## [4] {citrus fruit, root vegetables} => {other vegetables} 0.0104
## [5] {tropical fruit, root vegetables} => {other vegetables} 0.0123
## [6] {other vegetables, whole milk} => {root vegetables} 0.0232
## [7] {whole milk, curd} => {yogurt} 0.0101
## [8] {root vegetables, rolls/buns} => {other vegetables} 0.0122
## [9] {root vegetables, yogurt} => {other vegetables} 0.0129
## [10] {tropical fruit, whole milk} => {yogurt} 0.0151
## [11] {yogurt, whipped/sour cream} => {other vegetables} 0.0102
## [12] {other vegetables, whipped/sour cream} => {yogurt} 0.0102
## [13] {tropical fruit, other vegetables} => {yogurt} 0.0123
## [14] {root vegetables, whole milk} => {other vegetables} 0.0232
## [15] {whole milk, whipped/sour cream} => {yogurt} 0.0109
## confidence coverage lift count
## [1] 0.359 0.0289 3.30 102
## [2] 0.343 0.0359 3.14 121
## [3] 0.331 0.0525 3.04 171
## [4] 0.586 0.0177 3.03 102
## [5] 0.585 0.0210 3.02 121
## [6] 0.310 0.0748 2.84 228
## [7] 0.385 0.0261 2.76 99
## [8] 0.502 0.0243 2.59 120
## [9] 0.500 0.0258 2.58 127
## [10] 0.358 0.0423 2.57 149
## [11] 0.490 0.0207 2.53 100
## [12] 0.352 0.0289 2.52 100
## [13] 0.343 0.0359 2.46 121
## [14] 0.474 0.0489 2.45 228
## [15] 0.338 0.0322 2.42 107
Redundant rules provide no additional information beyond more general rules. We remove them to obtain a cleaner and more interpretable set.
rules_nr <- rules_base[!is.redundant(rules_base)]
cat("Rules before:", length(rules_base), "\n")
## Rules before: 125
cat("Rules non-redundant:", length(rules_nr), "\n")
## Rules non-redundant: 123
whole milk is among the most frequent items. Rules
predicting it can support cross-selling ideas.
rules_milk <- subset(rules_nr, rhs %in% "whole milk")
rules_milk <- sort(rules_milk, by = "lift", decreasing = TRUE)
if (length(rules_milk) == 0) {
cat("No rules found with RHS = 'whole milk' under the current thresholds.\n")
cat("Try lowering support/confidence slightly in the Apriori parameters.\n")
} else {
rules_milk
inspect(head(rules_milk, 15))
}
## lhs rhs support confidence coverage lift count
## [1] {curd,
## yogurt} => {whole milk} 0.0101 0.582 0.0173 2.28 99
## [2] {other vegetables,
## butter} => {whole milk} 0.0115 0.574 0.0200 2.24 113
## [3] {tropical fruit,
## root vegetables} => {whole milk} 0.0120 0.570 0.0210 2.23 118
## [4] {root vegetables,
## yogurt} => {whole milk} 0.0145 0.563 0.0258 2.20 143
## [5] {other vegetables,
## domestic eggs} => {whole milk} 0.0123 0.553 0.0223 2.16 121
## [6] {yogurt,
## whipped/sour cream} => {whole milk} 0.0109 0.525 0.0207 2.05 107
## [7] {root vegetables,
## rolls/buns} => {whole milk} 0.0127 0.523 0.0243 2.05 125
## [8] {pip fruit,
## other vegetables} => {whole milk} 0.0135 0.518 0.0261 2.03 133
## [9] {tropical fruit,
## yogurt} => {whole milk} 0.0151 0.517 0.0293 2.02 149
## [10] {other vegetables,
## yogurt} => {whole milk} 0.0223 0.513 0.0434 2.01 219
## [11] {other vegetables,
## whipped/sour cream} => {whole milk} 0.0146 0.507 0.0289 1.98 144
## [12] {other vegetables,
## fruit/vegetable juice} => {whole milk} 0.0105 0.498 0.0210 1.95 103
## [13] {butter} => {whole milk} 0.0276 0.497 0.0554 1.95 271
## [14] {curd} => {whole milk} 0.0261 0.490 0.0533 1.92 257
## [15] {root vegetables,
## other vegetables} => {whole milk} 0.0232 0.489 0.0474 1.91 228
q_df <- as.data.frame(quality(rules_nr))
ggplot(q_df, aes(x = support, y = confidence)) +
geom_point(alpha = 0.5) +
labs(title = "Rule quality: Support vs Confidence", x = "Support", y = "Confidence")
ggplot(q_df, aes(x = support, y = lift)) +
geom_point(alpha = 0.5) +
labs(title = "Rule quality: Support vs Lift", x = "Support", y = "Lift")
To demonstrate the impact of parameter choice, we mine rules under two alternative settings:
rules_strict <- apriori(
Groceries,
parameter = list(supp = 0.02, conf = 0.40, minlen = 2, maxlen = 4, target = "rules")
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.02 2
## maxlen target ext
## 4 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 196
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [59 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules_loose <- apriori(
Groceries,
parameter = list(supp = 0.005, conf = 0.25, minlen = 2, maxlen = 4, target = "rules")
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.25 0.1 1 none FALSE TRUE 5 0.005 2
## maxlen target ext
## 4 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 49
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [120 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4
## done [0.00s].
## writing ... [662 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
data.frame(
setting = c("baseline (supp=0.01, conf=0.30)", "strict (supp=0.02, conf=0.40)", "loose (supp=0.005, conf=0.25)"),
n_rules = c(length(rules_base), length(rules_strict), length(rules_loose))
)
If arulesViz is installed, we add an additional
scatterplot visualization. If it is not installed, this section is
skipped and the report still knits successfully.
rules_plot <- head(sort(rules_nr, by = "lift", decreasing = TRUE), 50)
plot(rules_plot, method = "scatterplot", measure = c("support", "confidence"), shading = "lift")
whole milk) can
be useful for cross-selling, but may require balancing lift with
adequate support.sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_China.utf8
## [2] LC_CTYPE=Chinese (Simplified)_China.utf8
## [3] LC_MONETARY=Chinese (Simplified)_China.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=Chinese (Simplified)_China.utf8
##
## time zone: Europe/Warsaw
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tidyr_1.3.2 dplyr_1.1.4 ggplot2_4.0.1 arules_1.7-11 Matrix_1.7-3
##
## loaded via a namespace (and not attached):
## [1] vctrs_0.6.5 cli_3.6.5 knitr_1.50 rlang_1.1.6
## [5] xfun_0.54 purrr_1.1.0 generics_0.1.4 S7_0.2.0
## [9] jsonlite_2.0.0 labeling_0.4.3 glue_1.8.0 htmltools_0.5.8.1
## [13] sass_0.4.10 scales_1.4.0 rmarkdown_2.30 grid_4.5.1
## [17] tibble_3.3.0 evaluate_1.0.5 jquerylib_0.1.4 fastmap_1.2.0
## [21] yaml_2.3.10 lifecycle_1.0.4 compiler_4.5.1 codetools_0.2-20
## [25] RColorBrewer_1.1-3 pkgconfig_2.0.3 rstudioapi_0.17.1 farver_2.1.2
## [29] lattice_0.22-7 digest_0.6.37 R6_2.6.1 tidyselect_1.2.1
## [33] pillar_1.11.1 magrittr_2.0.4 bslib_0.9.0 withr_3.0.2
## [37] tools_4.5.1 gtable_0.3.6 cachem_1.1.0