1 1. Introduction

1.1 1.1 Motivation

Association rule mining (market-basket analysis) discovers co-occurrence patterns in transactional data. Typical business applications include: - product bundling and cross-selling, - store layout optimization, - recommendation systems, - identifying complementary vs. substitute items.

1.2 1.2 Research question

Which item combinations and association rules are most informative in the Groceries transaction dataset, and how do support/confidence/lift trade off under different threshold choices?

2 2. Data

2.1 2.1 Dataset: Groceries transactions (from `arules`)

We use the classic Groceries dataset from the arules package: - ~9,800 transactions (shopping baskets), - 169 unique items.

This dataset is widely used for teaching association rules because it is large enough to be realistic but still manageable.

# This report relies on a few CRAN packages. If they are missing, we install them once.
required_pkgs <- c("arules", "ggplot2", "dplyr", "tidyr")

missing <- required_pkgs[!vapply(required_pkgs, requireNamespace, quietly = TRUE, FUN.VALUE = logical(1))]
if (length(missing) > 0) {
  message("Installing missing packages: ", paste(missing, collapse = ", "))
  install.packages(missing, repos = "https://cloud.r-project.org")
}

# Load libraries (fail fast if installation was blocked)
invisible(lapply(required_pkgs, library, character.only = TRUE))

# Optional: nicer visualizations (not required). We only use it if installed.
has_arulesviz <- requireNamespace("arulesViz", quietly = TRUE)
if (has_arulesviz) library(arulesViz)

data("Groceries")  # loads an object named Groceries
Groceries

## transactions in sparse format with
##  9835 transactions (rows) and
##  169 items (columns)

2.2 2.2 Quick data overview

summary(Groceries)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.0261 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00    3.00    4.41    6.00   32.00 
## 
## includes extended item information - examples:
##        labels  level2           level1
## 1 frankfurter sausage meat and sausage
## 2     sausage sausage meat and sausage
## 3  liver loaf sausage meat and sausage

n_trans <- length(Groceries)
n_items <- length(itemLabels(Groceries))

# Total number of item occurrences across all transactions
n_nonzero <- sum(size(Groceries))

# Transaction matrix density = nonzero / (n_trans * n_items)
dens <- n_nonzero / (n_trans * n_items)

cat("Number of transactions:", n_trans, "
")

## Number of transactions: 9835

cat("Number of items:", n_items, "
")

## Number of items: 169

cat("Total item occurrences:", n_nonzero, "
")

## Total item occurrences: 43367

cat("Density (non-zero / total possible):", round(dens, 4), "
")

## Density (non-zero / total possible): 0.0261

3 3. Exploratory analysis

3.1 3.1 Item frequency (top items)

item_freq <- itemFrequency(Groceries, type = "relative")

item_freq_df <- data.frame(
  item = names(item_freq),
  frequency = as.numeric(item_freq)
) %>%
  arrange(desc(frequency))

head(item_freq_df, 15)

top_n <- 20
ggplot(item_freq_df[1:top_n, ], aes(x = reorder(item, frequency), y = frequency)) +
  geom_col() +
  coord_flip() +
  labs(
    title = paste("Top", top_n, "items by relative frequency"),
    x = "",
    y = "Relative frequency"
  )

3.2 3.2 Why thresholds matter

Support measures prevalence: how often an itemset appears.
Confidence measures conditional probability: P(RHS | LHS).
Lift measures association strength beyond chance: values > 1 indicate positive association.

Thresholds are a trade-off: - too strict → too few rules, - too loose → too many noisy rules.

4 4. Mining frequent itemsets and rules (Apriori)

We start with a baseline parameter setting designed to yield a readable number of rules: - min support = 0.01 (≥ 1% of baskets), - min confidence = 0.30, - rule length between 2 and 4 items total.

rules_base <- apriori(
  Groceries,
  parameter = list(
    supp = 0.01,
    conf = 0.30,
    minlen = 2,
    maxlen = 4,
    target = "rules"
  )
)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target  ext
##       4  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 98 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4

##  done [0.00s].
## writing ... [125 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules_base

## set of 125 rules

4.1 4.1 Inspect top rules by lift

rules_top <- sort(rules_base, by = "lift", decreasing = TRUE)
inspect(head(rules_top, 15))

##      lhs                                       rhs                support
## [1]  {citrus fruit, other vegetables}       => {root vegetables}  0.0104 
## [2]  {tropical fruit, other vegetables}     => {root vegetables}  0.0123 
## [3]  {beef}                                 => {root vegetables}  0.0174 
## [4]  {citrus fruit, root vegetables}        => {other vegetables} 0.0104 
## [5]  {tropical fruit, root vegetables}      => {other vegetables} 0.0123 
## [6]  {other vegetables, whole milk}         => {root vegetables}  0.0232 
## [7]  {whole milk, curd}                     => {yogurt}           0.0101 
## [8]  {root vegetables, rolls/buns}          => {other vegetables} 0.0122 
## [9]  {root vegetables, yogurt}              => {other vegetables} 0.0129 
## [10] {tropical fruit, whole milk}           => {yogurt}           0.0151 
## [11] {yogurt, whipped/sour cream}           => {other vegetables} 0.0102 
## [12] {other vegetables, whipped/sour cream} => {yogurt}           0.0102 
## [13] {tropical fruit, other vegetables}     => {yogurt}           0.0123 
## [14] {root vegetables, whole milk}          => {other vegetables} 0.0232 
## [15] {whole milk, whipped/sour cream}       => {yogurt}           0.0109 
##      confidence coverage lift count
## [1]  0.359      0.0289   3.30 102  
## [2]  0.343      0.0359   3.14 121  
## [3]  0.331      0.0525   3.04 171  
## [4]  0.586      0.0177   3.03 102  
## [5]  0.585      0.0210   3.02 121  
## [6]  0.310      0.0748   2.84 228  
## [7]  0.385      0.0261   2.76  99  
## [8]  0.502      0.0243   2.59 120  
## [9]  0.500      0.0258   2.58 127  
## [10] 0.358      0.0423   2.57 149  
## [11] 0.490      0.0207   2.53 100  
## [12] 0.352      0.0289   2.52 100  
## [13] 0.343      0.0359   2.46 121  
## [14] 0.474      0.0489   2.45 228  
## [15] 0.338      0.0322   2.42 107

5 5. Rule diagnostics and filtering

5.1 5.1 Remove redundant rules

Redundant rules provide no additional information beyond more general rules. We remove them to obtain a cleaner and more interpretable set.

rules_nr <- rules_base[!is.redundant(rules_base)]

cat("Rules before:", length(rules_base), "\n")

## Rules before: 125

cat("Rules non-redundant:", length(rules_nr), "\n")

## Rules non-redundant: 123

5.2 5.2 Focus on a business-relevant RHS (example: whole milk)

whole milk is among the most frequent items. Rules predicting it can support cross-selling ideas.

rules_milk <- subset(rules_nr, rhs %in% "whole milk")
rules_milk <- sort(rules_milk, by = "lift", decreasing = TRUE)

if (length(rules_milk) == 0) {
  cat("No rules found with RHS = 'whole milk' under the current thresholds.\n")
  cat("Try lowering support/confidence slightly in the Apriori parameters.\n")
} else {
  rules_milk
  inspect(head(rules_milk, 15))
}

##      lhs                        rhs          support confidence coverage lift count
## [1]  {curd,                                                                        
##       yogurt}                => {whole milk}  0.0101      0.582   0.0173 2.28    99
## [2]  {other vegetables,                                                            
##       butter}                => {whole milk}  0.0115      0.574   0.0200 2.24   113
## [3]  {tropical fruit,                                                              
##       root vegetables}       => {whole milk}  0.0120      0.570   0.0210 2.23   118
## [4]  {root vegetables,                                                             
##       yogurt}                => {whole milk}  0.0145      0.563   0.0258 2.20   143
## [5]  {other vegetables,                                                            
##       domestic eggs}         => {whole milk}  0.0123      0.553   0.0223 2.16   121
## [6]  {yogurt,                                                                      
##       whipped/sour cream}    => {whole milk}  0.0109      0.525   0.0207 2.05   107
## [7]  {root vegetables,                                                             
##       rolls/buns}            => {whole milk}  0.0127      0.523   0.0243 2.05   125
## [8]  {pip fruit,                                                                   
##       other vegetables}      => {whole milk}  0.0135      0.518   0.0261 2.03   133
## [9]  {tropical fruit,                                                              
##       yogurt}                => {whole milk}  0.0151      0.517   0.0293 2.02   149
## [10] {other vegetables,                                                            
##       yogurt}                => {whole milk}  0.0223      0.513   0.0434 2.01   219
## [11] {other vegetables,                                                            
##       whipped/sour cream}    => {whole milk}  0.0146      0.507   0.0289 1.98   144
## [12] {other vegetables,                                                            
##       fruit/vegetable juice} => {whole milk}  0.0105      0.498   0.0210 1.95   103
## [13] {butter}                => {whole milk}  0.0276      0.497   0.0554 1.95   271
## [14] {curd}                  => {whole milk}  0.0261      0.490   0.0533 1.92   257
## [15] {root vegetables,                                                             
##       other vegetables}      => {whole milk}  0.0232      0.489   0.0474 1.91   228

5.3 5.3 Visualizing support, confidence, and lift

q_df <- as.data.frame(quality(rules_nr))

ggplot(q_df, aes(x = support, y = confidence)) +
  geom_point(alpha = 0.5) +
  labs(title = "Rule quality: Support vs Confidence", x = "Support", y = "Confidence")

ggplot(q_df, aes(x = support, y = lift)) +
  geom_point(alpha = 0.5) +
  labs(title = "Rule quality: Support vs Lift", x = "Support", y = "Lift")

6 6. Sensitivity analysis: changing thresholds

To demonstrate the impact of parameter choice, we mine rules under two alternative settings:

Stricter: higher support and confidence
Looser: lower support and confidence

rules_strict <- apriori(
  Groceries,
  parameter = list(supp = 0.02, conf = 0.40, minlen = 2, maxlen = 4, target = "rules")
)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5    0.02      2
##  maxlen target  ext
##       4  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 196 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [59 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules_loose <- apriori(
  Groceries,
  parameter = list(supp = 0.005, conf = 0.25, minlen = 2, maxlen = 4, target = "rules")
)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5   0.005      2
##  maxlen target  ext
##       4  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 49 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [120 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4

##  done [0.00s].
## writing ... [662 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

data.frame(
  setting = c("baseline (supp=0.01, conf=0.30)", "strict (supp=0.02, conf=0.40)", "loose (supp=0.005, conf=0.25)"),
  n_rules = c(length(rules_base), length(rules_strict), length(rules_loose))
)

7 7. Optional visualization with arulesViz (only if installed)

If arulesViz is installed, we add an additional scatterplot visualization. If it is not installed, this section is skipped and the report still knits successfully.

rules_plot <- head(sort(rules_nr, by = "lift", decreasing = TRUE), 50)
plot(rules_plot, method = "scatterplot", measure = c("support", "confidence"), shading = "lift")

8 8. Discussion

8.1 8.1 Key findings

The strongest rules (high lift) often have lower support, meaning they apply to a smaller niche of baskets.
Rules predicting frequent items (e.g., whole milk) can be useful for cross-selling, but may require balancing lift with adequate support.
Removing redundant rules significantly improves interpretability.

8.2 8.2 Limitations

Association rules do not imply causality; they capture co-occurrence.
The dataset lacks time, price, or customer identity—limiting personalization.
Threshold choice strongly shapes results; sensitivity checks are essential.

8.3 8.3 Next steps

Use customer IDs to mine segment-specific rules (e.g., by cluster).
Explore sequential pattern mining if transaction order matters.
Combine rules with profit margins for business-optimized recommendations.

9 9. Reproducibility

sessionInfo()

## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_China.utf8 
## [2] LC_CTYPE=Chinese (Simplified)_China.utf8   
## [3] LC_MONETARY=Chinese (Simplified)_China.utf8
## [4] LC_NUMERIC=C                               
## [5] LC_TIME=Chinese (Simplified)_China.utf8    
## 
## time zone: Europe/Warsaw
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tidyr_1.3.2   dplyr_1.1.4   ggplot2_4.0.1 arules_1.7-11 Matrix_1.7-3 
## 
## loaded via a namespace (and not attached):
##  [1] vctrs_0.6.5        cli_3.6.5          knitr_1.50         rlang_1.1.6       
##  [5] xfun_0.54          purrr_1.1.0        generics_0.1.4     S7_0.2.0          
##  [9] jsonlite_2.0.0     labeling_0.4.3     glue_1.8.0         htmltools_0.5.8.1 
## [13] sass_0.4.10        scales_1.4.0       rmarkdown_2.30     grid_4.5.1        
## [17] tibble_3.3.0       evaluate_1.0.5     jquerylib_0.1.4    fastmap_1.2.0     
## [21] yaml_2.3.10        lifecycle_1.0.4    compiler_4.5.1     codetools_0.2-20  
## [25] RColorBrewer_1.1-3 pkgconfig_2.0.3    rstudioapi_0.17.1  farver_2.1.2      
## [29] lattice_0.22-7     digest_0.6.37      R6_2.6.1           tidyselect_1.2.1  
## [33] pillar_1.11.1      magrittr_2.0.4     bslib_0.9.0        withr_3.0.2       
## [37] tools_4.5.1        gtable_0.3.6       cachem_1.1.0

Unsupervised Learning — Project 3 (Association Rules / Basket Analysis)

Mining market-basket patterns with Apriori on the Groceries dataset

Rongfeng Qiu (Student ID: 488004)

2025-12-28

1 1. Introduction

1.1 1.1 Motivation

1.2 1.2 Research question

2 2. Data

2.1 2.1 Dataset: Groceries transactions (from `arules`)

2.2 2.2 Quick data overview

3 3. Exploratory analysis

3.1 3.1 Item frequency (top items)

3.2 3.2 Why thresholds matter

4 4. Mining frequent itemsets and rules (Apriori)

4.1 4.1 Inspect top rules by lift

5 5. Rule diagnostics and filtering

5.1 5.1 Remove redundant rules

5.2 5.2 Focus on a business-relevant RHS (example: whole milk)

5.3 5.3 Visualizing support, confidence, and lift

6 6. Sensitivity analysis: changing thresholds

7 7. Optional visualization with arulesViz (only if installed)

8 8. Discussion

8.1 8.1 Key findings

8.2 8.2 Limitations

8.3 8.3 Next steps

9 9. Reproducibility

Unsupervised Learning — Project 3 (Association Rules / Basket Analysis)

Mining market-basket patterns with Apriori on the Groceries dataset

Rongfeng Qiu (Student ID: 488004)

2025-12-28

1 1. Introduction

1.1 1.1 Motivation

1.2 1.2 Research question

2 2. Data

2.1 2.1 Dataset: Groceries transactions (from arules)

2.2 2.2 Quick data overview

3 3. Exploratory analysis

3.1 3.1 Item frequency (top items)

3.2 3.2 Why thresholds matter

4 4. Mining frequent itemsets and rules (Apriori)

4.1 4.1 Inspect top rules by lift

5 5. Rule diagnostics and filtering

5.1 5.1 Remove redundant rules

5.2 5.2 Focus on a business-relevant RHS (example: whole milk)

5.3 5.3 Visualizing support, confidence, and lift

6 6. Sensitivity analysis: changing thresholds

7 7. Optional visualization with arulesViz (only if installed)

8 8. Discussion

8.1 8.1 Key findings

8.2 8.2 Limitations

8.3 8.3 Next steps

9 9. Reproducibility

2.1 2.1 Dataset: Groceries transactions (from `arules`)