1. Introduction

Market Basket Analysis uses association rule mining to discover relationships among items that co-occur in transactions. The classic setting is a grocery store: each receipt is a transaction, and the items on it form an itemset. We mine rules of the form

\[\{A, B\} \Rightarrow \{C\}\]

and evaluate them with three key metrics:

Metric	Formula	Interpretation
Support	\(P(A \cup B)\)	How often the itemset appears overall
Confidence	\(P(B \mid A)\)	How often the rule is correct
Lift	\(\frac{P(A \cup B)}{P(A)\,P(B)}\)	How much more likely than random chance

A lift > 1 indicates a genuine positive association.

2. Data Loading & Exploration

library(tidyverse)
library(arules)
library(arulesViz)
library(knitr)
library(kableExtra)

# Read as transactions directly — each row is one basket
txns <- read.transactions(
  "GroceryDataSet.csv",
  format = "basket",
  sep    = ",",
  rm.duplicates = TRUE
)

txns

## transactions in sparse format with
##  9835 transactions (rows) and
##  169 items (columns)

summary(txns)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

itemFrequencyPlot(
  txns,
  topN      = 20,
  type      = "relative",
  col       = colorRampPalette(c("#2c7bb6", "#abd9e9", "#fdae61", "#d7191c"))(20),
  main      = "Top 20 Items by Relative Frequency",
  ylab      = "Item Frequency (proportion of transactions)",
  cex.names = 0.75
)

Top 20 most frequent items across all transactions

Key observations:

Whole milk appears in ~25 % of all baskets — by far the most common item.
Other vegetables, rolls/buns, and soda round out the top items.
The long tail of infrequent items is typical of retail data and motivates using a minimum support threshold to keep the rule search tractable.

3. Association Rule Mining

3.1 Parameter Selection

We use the Apriori algorithm with:

support = 0.001 — an itemset must appear in at least ~10 transactions
confidence = 0.25 — the rule must be correct at least 25 % of the time
minlen = 2 — rules must have at least one antecedent item

These thresholds balance rule quantity with meaningfulness.

rules <- apriori(
  txns,
  parameter = list(
    support    = 0.001,
    confidence = 0.25,
    minlen     = 2,
    maxlen     = 5
  )
)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5   0.001      2
##  maxlen target  ext
##       5  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5

##  done [0.01s].
## writing ... [17331 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules

## set of 17331 rules

summary(rules)

## set of 17331 rules
## 
## rule length distribution (lhs + rhs):sizes
##    2    3    4    5 
##  367 6906 8371 1687 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   4.000   3.657   4.000   5.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift        
##  Min.   :0.001017   Min.   :0.2500   Min.   :0.001017   Min.   : 0.9784  
##  1st Qu.:0.001118   1st Qu.:0.3115   1st Qu.:0.002542   1st Qu.: 2.1534  
##  Median :0.001322   Median :0.4000   Median :0.003559   Median : 2.7955  
##  Mean   :0.001917   Mean   :0.4387   Mean   :0.004982   Mean   : 3.0773  
##  3rd Qu.:0.001932   3rd Qu.:0.5385   3rd Qu.:0.005186   3rd Qu.: 3.6563  
##  Max.   :0.074835   Max.   :1.0000   Max.   :0.255516   Max.   :35.7158  
##      count       
##  Min.   : 10.00  
##  1st Qu.: 11.00  
##  Median : 13.00  
##  Mean   : 18.85  
##  3rd Qu.: 19.00  
##  Max.   :736.00  
## 
## mining info:
##  data ntransactions support confidence
##  txns          9835   0.001       0.25
##                                                                                                call
##  apriori(data = txns, parameter = list(support = 0.001, confidence = 0.25, minlen = 2, maxlen = 5))

3.2 Top 10 Rules by Lift

top10 <- sort(rules, by = "lift", decreasing = TRUE)[1:10]

top10_df <- as(top10, "data.frame") %>%
  mutate(
    rules      = as.character(rules),
    support    = round(support,    4),
    confidence = round(confidence, 4),
    lift       = round(lift,       4),
    count      = count
  ) %>%
  select(rules, support, confidence, lift, count)

kable(
  top10_df,
  caption = "Top 10 Association Rules by Lift",
  col.names = c("Rule", "Support", "Confidence", "Lift", "Count"),
  align = "lcccc"
) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  column_spec(4, bold = TRUE, color = "white",
              background = spec_color(top10_df$lift, option = "D"))

Top 10 Association Rules by Lift
	Rule	Support	Confidence	Lift	Count
380	{bottled beer,red/blush wine} => {liquor}	0.0019	0.3958	35.7158	19
1147	{ham,white bread} => {processed cheese}	0.0019	0.3800	22.9282	19
379	{bottled beer,liquor} => {red/blush wine}	0.0019	0.4130	21.4936	19
442	{Instant food products,soda} => {hamburger meat}	0.0012	0.6316	18.9957	12
1585	{curd,sugar} => {flour}	0.0011	0.3235	18.6077	11
1497	{baking powder,sugar} => {flour}	0.0010	0.3125	17.9733	10
1146	{processed cheese,white bread} => {ham}	0.0019	0.4634	17.8034	19
1150	{fruit/vegetable juice,ham} => {processed cheese}	0.0011	0.2895	17.4661	11
1588	{margarine,sugar} => {flour}	0.0016	0.2963	17.0414	16
7495	{root vegetables,sugar,whole milk} => {flour}	0.0010	0.2941	16.9161	10

Reading the table:

A lift of 10+ means customers who bought the antecedent items are 10× more likely to also buy the consequent than an average shopper — a strong, actionable signal.
Support is intentionally low for high-lift rules: rare but tightly linked pairs are still valuable for targeted promotions.
Confidence tells the retailer how reliable each cross-sell recommendation is.

4. Rule Visualizations

4.1 Scatter Plot — Support vs. Confidence (color = Lift)

plot(
  rules,
  method  = "scatterplot",
  measure = c("support", "confidence"),
  shading = "lift",
  main    = "Association Rules: Support vs. Confidence (shaded by Lift)"
)

Each point is one rule; higher lift rules appear darker/warmer

4.2 Interactive Graph of Top 30 Rules by Lift

top30 <- sort(rules, by = "lift")[1:30]

plot(
  top30,
  method  = "graph",
  engine  = "htmlwidget"
)

Nodes = items, edges = rules; node size ∝ support, edge color ∝ lift

4.3 Grouped Matrix Plot — Top 20 Rules

top20 <- sort(rules, by = "lift")[1:20]

plot(
  top20,
  method = "grouped",
  main   = "Grouped Matrix of Top 20 Rules by Lift"
)

## Available control parameters (with default values):
## k     =  20
## aggr.fun  =  function (x, ...)  UseMethod("mean")
## rhs_max   =  10
## lhs_label_items   =  2
## col   =  c("#EE0000FF", "#EEEEEEFF")
## groups    =  NULL
## engine    =  ggplot2
## verbose   =  FALSE

Antecedents (rows) × consequents (columns); size = support, color = lift

4.4 Parallel Coordinates Plot

plot(
  top20,
  method = "paracoord",
  main   = "Parallel Coordinates — Top 20 Rules by Lift"
)

Each line traces one rule from antecedent(s) through to consequent

5. Extra Credit — Cluster Analysis

We cluster the items (not transactions) based on how often they co-occur, using the binary item matrix and hierarchical clustering with Jaccard distance — the natural distance for binary co-occurrence data.

5.1 Build Item Co-occurrence Matrix

# Keep only items appearing in ≥ 1 % of transactions for a cleaner cluster map
freq_items <- names(itemFrequency(txns)[itemFrequency(txns) >= 0.01])
txns_sub   <- txns[, freq_items]

# Item × transaction binary matrix, then transpose to item × item distances
item_mat <- as(txns_sub, "matrix")          # transactions × items
item_dist <- dist(t(item_mat), method = "binary")   # Jaccard distance

5.2 Hierarchical Clustering

hc <- hclust(item_dist, method = "ward.D2")

plot(
  hc,
  main   = "Hierarchical Clustering of Grocery Items\n(Jaccard distance, Ward linkage)",
  xlab   = "",
  ylab   = "Height",
  cex    = 0.75,
  col    = "steelblue"
)

# Cut into 5 clusters and color the dendrogram
rect.hclust(hc, k = 5, border = c("#e41a1c","#377eb8","#4daf4a","#984ea3","#ff7f00"))

Ward-linkage dendrogram of frequent grocery items

5.3 Cluster Membership

clusters <- cutree(hc, k = 5)

cluster_df <- data.frame(
  Item    = names(clusters),
  Cluster = paste("Cluster", clusters)
) %>%
  arrange(Cluster, Item)

kable(
  cluster_df,
  caption = "Item Cluster Assignments (k = 5)",
  col.names = c("Item", "Cluster")
) %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE) %>%
  row_spec(which(cluster_df$Cluster == "Cluster 1"), background = "#fde0dc") %>%
  row_spec(which(cluster_df$Cluster == "Cluster 2"), background = "#dceefb") %>%
  row_spec(which(cluster_df$Cluster == "Cluster 3"), background = "#dcfbe5") %>%
  row_spec(which(cluster_df$Cluster == "Cluster 4"), background = "#f3dcfb") %>%
  row_spec(which(cluster_df$Cluster == "Cluster 5"), background = "#fbf3dc")

Item Cluster Assignments (k = 5)
Item	Cluster
UHT-milk	Cluster 1
baking powder	Cluster 1
beverages	Cluster 1
butter milk	Cluster 1
cake bar	Cluster 1
canned fish	Cluster 1
canned vegetables	Cluster 1
cat food	Cluster 1
chewing gum	Cluster 1
cling film/bags	Cluster 1
coffee	Cluster 1
condensed milk	Cluster 1
cream cheese	Cluster 1
curd	Cluster 1
dessert	Cluster 1
detergent	Cluster 1
dish cleaner	Cluster 1
dishes	Cluster 1
flour	Cluster 1
flower (seeds)	Cluster 1
frozen dessert	Cluster 1
frozen fish	Cluster 1
frozen meals	Cluster 1
grapes	Cluster 1
hamburger meat	Cluster 1
hard cheese	Cluster 1
herbs	Cluster 1
hygiene articles	Cluster 1
ice cream	Cluster 1
meat	Cluster 1
misc. beverages	Cluster 1
mustard	Cluster 1
napkins	Cluster 1
oil	Cluster 1
onions	Cluster 1
packaged fruit/vegetables	Cluster 1
pasta	Cluster 1
pickled vegetables	Cluster 1
pot plants	Cluster 1
roll products	Cluster 1
salt	Cluster 1
seasonal products	Cluster 1
semi-finished bread	Cluster 1
sliced cheese	Cluster 1
soft cheese	Cluster 1
spread cheese	Cluster 1
sugar	Cluster 1
white wine	Cluster 1
beef	Cluster 2
berries	Cluster 2
butter	Cluster 2
chicken	Cluster 2
citrus fruit	Cluster 2
domestic eggs	Cluster 2
frozen vegetables	Cluster 2
margarine	Cluster 2
other vegetables	Cluster 2
pip fruit	Cluster 2
pork	Cluster 2
root vegetables	Cluster 2
tropical fruit	Cluster 2
whipped/sour cream	Cluster 2
whole milk	Cluster 2
yogurt	Cluster 2
bottled beer	Cluster 3
bottled water	Cluster 3
brown bread	Cluster 3
canned beer	Cluster 3
frankfurter	Cluster 3
fruit/vegetable juice	Cluster 3
newspapers	Cluster 3
pastry	Cluster 3
rolls/buns	Cluster 3
sausage	Cluster 3
shopping bags	Cluster 3
soda	Cluster 3
candy	Cluster 4
chocolate	Cluster 4
ham	Cluster 4
long life bakery product	Cluster 4
processed cheese	Cluster 4
salty snack	Cluster 4
specialty bar	Cluster 4
specialty chocolate	Cluster 4
waffles	Cluster 4
white bread	Cluster 4
liquor	Cluster 5
red/blush wine	Cluster 5

5.4 Cluster Interpretation

# Frequency of each item within each cluster
freq_vec <- itemFrequency(txns_sub)

profile_df <- data.frame(
  Item      = names(freq_vec),
  Frequency = round(freq_vec, 4),
  Cluster   = paste("Cluster", clusters[names(freq_vec)])
) %>%
  group_by(Cluster) %>%
  arrange(desc(Frequency), .by_group = TRUE) %>%
  slice_head(n = 5)

kable(
  profile_df,
  caption = "Top 5 Items per Cluster by Transaction Frequency",
  col.names = c("Item", "Frequency", "Cluster")
) %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)

Top 5 Items per Cluster by Transaction Frequency
Item	Frequency	Cluster
coffee	0.0581	Cluster 1
curd	0.0533	Cluster 1
napkins	0.0524	Cluster 1
cream cheese	0.0397	Cluster 1
dessert	0.0371	Cluster 1
whole milk	0.2555	Cluster 2
other vegetables	0.1935	Cluster 2
yogurt	0.1395	Cluster 2
root vegetables	0.1090	Cluster 2
tropical fruit	0.1049	Cluster 2
rolls/buns	0.1839	Cluster 3
soda	0.1744	Cluster 3
bottled water	0.1105	Cluster 3
shopping bags	0.0985	Cluster 3
sausage	0.0940	Cluster 3
chocolate	0.0496	Cluster 4
white bread	0.0421	Cluster 4
waffles	0.0384	Cluster 4
salty snack	0.0378	Cluster 4
long life bakery product	0.0374	Cluster 4
red/blush wine	0.0192	Cluster 5
liquor	0.0111	Cluster 5

Cluster interpretation:
The dendrogram and cluster profiles naturally reveal shopping personas or product categories — e.g., a dairy + staples cluster, a snack/beverage cluster, a fresh produce cluster, etc. These groupings can guide store layout, shelf placement, and targeted coupon campaigns.

6. Summary & Business Insights

Association Rules

High-lift pairs (lift > 5) represent the strongest cross-sell opportunities — products that are bought together far more than chance would predict.
Whole milk is a hub item appearing in many consequents; bundling promotions around it will reach a large share of shoppers.
Rules with higher confidence (> 0.5) are reliable enough to power recommendation engines at the point of sale.

Clustering

Hierarchical clustering groups items by co-purchase similarity, revealing natural product neighborhoods that can inform store layout and category management.
Items in the same cluster that are not already linked by a strong association rule are underexploited cross-sell opportunities worth investigating.

Analysis performed in R using arules, arulesViz, and base hierarchical clustering.

Market Basket Analysis – Groceries Dataset

Ariba Mandavia

2026-05-20