1. Introduction

Market Basket Analysis uses association rule mining to discover relationships among items that co-occur in transactions. The classic setting is a grocery store: each receipt is a transaction, and the items on it form an itemset. We mine rules of the form

\[\{A, B\} \Rightarrow \{C\}\]

and evaluate them with three key metrics:

Metric Formula Interpretation
Support \(P(A \cup B)\) How often the itemset appears overall
Confidence \(P(B \mid A)\) How often the rule is correct
Lift \(\frac{P(A \cup B)}{P(A)\,P(B)}\) How much more likely than random chance

A lift > 1 indicates a genuine positive association.


2. Data Loading & Exploration

library(tidyverse)
library(arules)
library(arulesViz)
library(knitr)
library(kableExtra)
# Read as transactions directly — each row is one basket
txns <- read.transactions(
  "GroceryDataSet.csv",
  format = "basket",
  sep    = ",",
  rm.duplicates = TRUE
)

txns
## transactions in sparse format with
##  9835 transactions (rows) and
##  169 items (columns)
summary(txns)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics
itemFrequencyPlot(
  txns,
  topN      = 20,
  type      = "relative",
  col       = colorRampPalette(c("#2c7bb6", "#abd9e9", "#fdae61", "#d7191c"))(20),
  main      = "Top 20 Items by Relative Frequency",
  ylab      = "Item Frequency (proportion of transactions)",
  cex.names = 0.75
)
Top 20 most frequent items across all transactions

Top 20 most frequent items across all transactions

Key observations:

  • Whole milk appears in ~25 % of all baskets — by far the most common item.
  • Other vegetables, rolls/buns, and soda round out the top items.
  • The long tail of infrequent items is typical of retail data and motivates using a minimum support threshold to keep the rule search tractable.

3. Association Rule Mining

3.1 Parameter Selection

We use the Apriori algorithm with:

  • support = 0.001 — an itemset must appear in at least ~10 transactions
  • confidence = 0.25 — the rule must be correct at least 25 % of the time
  • minlen = 2 — rules must have at least one antecedent item

These thresholds balance rule quantity with meaningfulness.

rules <- apriori(
  txns,
  parameter = list(
    support    = 0.001,
    confidence = 0.25,
    minlen     = 2,
    maxlen     = 5
  )
)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5   0.001      2
##  maxlen target  ext
##       5  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5
##  done [0.01s].
## writing ... [17331 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
rules
## set of 17331 rules
summary(rules)
## set of 17331 rules
## 
## rule length distribution (lhs + rhs):sizes
##    2    3    4    5 
##  367 6906 8371 1687 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   4.000   3.657   4.000   5.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift        
##  Min.   :0.001017   Min.   :0.2500   Min.   :0.001017   Min.   : 0.9784  
##  1st Qu.:0.001118   1st Qu.:0.3115   1st Qu.:0.002542   1st Qu.: 2.1534  
##  Median :0.001322   Median :0.4000   Median :0.003559   Median : 2.7955  
##  Mean   :0.001917   Mean   :0.4387   Mean   :0.004982   Mean   : 3.0773  
##  3rd Qu.:0.001932   3rd Qu.:0.5385   3rd Qu.:0.005186   3rd Qu.: 3.6563  
##  Max.   :0.074835   Max.   :1.0000   Max.   :0.255516   Max.   :35.7158  
##      count       
##  Min.   : 10.00  
##  1st Qu.: 11.00  
##  Median : 13.00  
##  Mean   : 18.85  
##  3rd Qu.: 19.00  
##  Max.   :736.00  
## 
## mining info:
##  data ntransactions support confidence
##  txns          9835   0.001       0.25
##                                                                                                call
##  apriori(data = txns, parameter = list(support = 0.001, confidence = 0.25, minlen = 2, maxlen = 5))

3.2 Top 10 Rules by Lift

top10 <- sort(rules, by = "lift", decreasing = TRUE)[1:10]

top10_df <- as(top10, "data.frame") %>%
  mutate(
    rules      = as.character(rules),
    support    = round(support,    4),
    confidence = round(confidence, 4),
    lift       = round(lift,       4),
    count      = count
  ) %>%
  select(rules, support, confidence, lift, count)

kable(
  top10_df,
  caption = "Top 10 Association Rules by Lift",
  col.names = c("Rule", "Support", "Confidence", "Lift", "Count"),
  align = "lcccc"
) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  column_spec(4, bold = TRUE, color = "white",
              background = spec_color(top10_df$lift, option = "D"))
Top 10 Association Rules by Lift
Rule Support Confidence Lift Count
380 {bottled beer,red/blush wine} => {liquor} 0.0019 0.3958 35.7158 19
1147 {ham,white bread} => {processed cheese} 0.0019 0.3800 22.9282 19
379 {bottled beer,liquor} => {red/blush wine} 0.0019 0.4130 21.4936 19
442 {Instant food products,soda} => {hamburger meat} 0.0012 0.6316 18.9957 12
1585 {curd,sugar} => {flour} 0.0011 0.3235 18.6077 11
1497 {baking powder,sugar} => {flour} 0.0010 0.3125 17.9733 10
1146 {processed cheese,white bread} => {ham} 0.0019 0.4634 17.8034 19
1150 {fruit/vegetable juice,ham} => {processed cheese} 0.0011 0.2895 17.4661 11
1588 {margarine,sugar} => {flour} 0.0016 0.2963 17.0414 16
7495 {root vegetables,sugar,whole milk} => {flour} 0.0010 0.2941 16.9161 10

Reading the table:

  • A lift of 10+ means customers who bought the antecedent items are 10× more likely to also buy the consequent than an average shopper — a strong, actionable signal.
  • Support is intentionally low for high-lift rules: rare but tightly linked pairs are still valuable for targeted promotions.
  • Confidence tells the retailer how reliable each cross-sell recommendation is.

4. Rule Visualizations

4.1 Scatter Plot — Support vs. Confidence (color = Lift)

plot(
  rules,
  method  = "scatterplot",
  measure = c("support", "confidence"),
  shading = "lift",
  main    = "Association Rules: Support vs. Confidence (shaded by Lift)"
)
Each point is one rule; higher lift rules appear darker/warmer

Each point is one rule; higher lift rules appear darker/warmer

4.2 Interactive Graph of Top 30 Rules by Lift

top30 <- sort(rules, by = "lift")[1:30]

plot(
  top30,
  method  = "graph",
  engine  = "htmlwidget"
)

Nodes = items, edges = rules; node size ∝ support, edge color ∝ lift

4.3 Grouped Matrix Plot — Top 20 Rules

top20 <- sort(rules, by = "lift")[1:20]

plot(
  top20,
  method = "grouped",
  main   = "Grouped Matrix of Top 20 Rules by Lift"
)
## Available control parameters (with default values):
## k     =  20
## aggr.fun  =  function (x, ...)  UseMethod("mean")
## rhs_max   =  10
## lhs_label_items   =  2
## col   =  c("#EE0000FF", "#EEEEEEFF")
## groups    =  NULL
## engine    =  ggplot2
## verbose   =  FALSE
Antecedents (rows) × consequents (columns); size = support, color = lift

Antecedents (rows) × consequents (columns); size = support, color = lift

4.4 Parallel Coordinates Plot

plot(
  top20,
  method = "paracoord",
  main   = "Parallel Coordinates — Top 20 Rules by Lift"
)
Each line traces one rule from antecedent(s) through to consequent

Each line traces one rule from antecedent(s) through to consequent


5. Extra Credit — Cluster Analysis

We cluster the items (not transactions) based on how often they co-occur, using the binary item matrix and hierarchical clustering with Jaccard distance — the natural distance for binary co-occurrence data.

5.1 Build Item Co-occurrence Matrix

# Keep only items appearing in ≥ 1 % of transactions for a cleaner cluster map
freq_items <- names(itemFrequency(txns)[itemFrequency(txns) >= 0.01])
txns_sub   <- txns[, freq_items]

# Item × transaction binary matrix, then transpose to item × item distances
item_mat <- as(txns_sub, "matrix")          # transactions × items
item_dist <- dist(t(item_mat), method = "binary")   # Jaccard distance

5.2 Hierarchical Clustering

hc <- hclust(item_dist, method = "ward.D2")

plot(
  hc,
  main   = "Hierarchical Clustering of Grocery Items\n(Jaccard distance, Ward linkage)",
  xlab   = "",
  ylab   = "Height",
  cex    = 0.75,
  col    = "steelblue"
)

# Cut into 5 clusters and color the dendrogram
rect.hclust(hc, k = 5, border = c("#e41a1c","#377eb8","#4daf4a","#984ea3","#ff7f00"))
Ward-linkage dendrogram of frequent grocery items

Ward-linkage dendrogram of frequent grocery items

5.3 Cluster Membership

clusters <- cutree(hc, k = 5)

cluster_df <- data.frame(
  Item    = names(clusters),
  Cluster = paste("Cluster", clusters)
) %>%
  arrange(Cluster, Item)

kable(
  cluster_df,
  caption = "Item Cluster Assignments (k = 5)",
  col.names = c("Item", "Cluster")
) %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE) %>%
  row_spec(which(cluster_df$Cluster == "Cluster 1"), background = "#fde0dc") %>%
  row_spec(which(cluster_df$Cluster == "Cluster 2"), background = "#dceefb") %>%
  row_spec(which(cluster_df$Cluster == "Cluster 3"), background = "#dcfbe5") %>%
  row_spec(which(cluster_df$Cluster == "Cluster 4"), background = "#f3dcfb") %>%
  row_spec(which(cluster_df$Cluster == "Cluster 5"), background = "#fbf3dc")
Item Cluster Assignments (k = 5)
Item Cluster
UHT-milk Cluster 1
baking powder Cluster 1
beverages Cluster 1
butter milk Cluster 1
cake bar Cluster 1
canned fish Cluster 1
canned vegetables Cluster 1
cat food Cluster 1
chewing gum Cluster 1
cling film/bags Cluster 1
coffee Cluster 1
condensed milk Cluster 1
cream cheese Cluster 1
curd Cluster 1
dessert Cluster 1
detergent Cluster 1
dish cleaner Cluster 1
dishes Cluster 1
flour Cluster 1
flower (seeds) Cluster 1
frozen dessert Cluster 1
frozen fish Cluster 1
frozen meals Cluster 1
grapes Cluster 1
hamburger meat Cluster 1
hard cheese Cluster 1
herbs Cluster 1
hygiene articles Cluster 1
ice cream Cluster 1
meat Cluster 1
misc. beverages Cluster 1
mustard Cluster 1
napkins Cluster 1
oil Cluster 1
onions Cluster 1
packaged fruit/vegetables Cluster 1
pasta Cluster 1
pickled vegetables Cluster 1
pot plants Cluster 1
roll products Cluster 1
salt Cluster 1
seasonal products Cluster 1
semi-finished bread Cluster 1
sliced cheese Cluster 1
soft cheese Cluster 1
spread cheese Cluster 1
sugar Cluster 1
white wine Cluster 1
beef Cluster 2
berries Cluster 2
butter Cluster 2
chicken Cluster 2
citrus fruit Cluster 2
domestic eggs Cluster 2
frozen vegetables Cluster 2
margarine Cluster 2
other vegetables Cluster 2
pip fruit Cluster 2
pork Cluster 2
root vegetables Cluster 2
tropical fruit Cluster 2
whipped/sour cream Cluster 2
whole milk Cluster 2
yogurt Cluster 2
bottled beer Cluster 3
bottled water Cluster 3
brown bread Cluster 3
canned beer Cluster 3
frankfurter Cluster 3
fruit/vegetable juice Cluster 3
newspapers Cluster 3
pastry Cluster 3
rolls/buns Cluster 3
sausage Cluster 3
shopping bags Cluster 3
soda Cluster 3
candy Cluster 4
chocolate Cluster 4
ham Cluster 4
long life bakery product Cluster 4
processed cheese Cluster 4
salty snack Cluster 4
specialty bar Cluster 4
specialty chocolate Cluster 4
waffles Cluster 4
white bread Cluster 4
liquor Cluster 5
red/blush wine Cluster 5

5.4 Cluster Interpretation

# Frequency of each item within each cluster
freq_vec <- itemFrequency(txns_sub)

profile_df <- data.frame(
  Item      = names(freq_vec),
  Frequency = round(freq_vec, 4),
  Cluster   = paste("Cluster", clusters[names(freq_vec)])
) %>%
  group_by(Cluster) %>%
  arrange(desc(Frequency), .by_group = TRUE) %>%
  slice_head(n = 5)

kable(
  profile_df,
  caption = "Top 5 Items per Cluster by Transaction Frequency",
  col.names = c("Item", "Frequency", "Cluster")
) %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)
Top 5 Items per Cluster by Transaction Frequency
Item Frequency Cluster
coffee 0.0581 Cluster 1
curd 0.0533 Cluster 1
napkins 0.0524 Cluster 1
cream cheese 0.0397 Cluster 1
dessert 0.0371 Cluster 1
whole milk 0.2555 Cluster 2
other vegetables 0.1935 Cluster 2
yogurt 0.1395 Cluster 2
root vegetables 0.1090 Cluster 2
tropical fruit 0.1049 Cluster 2
rolls/buns 0.1839 Cluster 3
soda 0.1744 Cluster 3
bottled water 0.1105 Cluster 3
shopping bags 0.0985 Cluster 3
sausage 0.0940 Cluster 3
chocolate 0.0496 Cluster 4
white bread 0.0421 Cluster 4
waffles 0.0384 Cluster 4
salty snack 0.0378 Cluster 4
long life bakery product 0.0374 Cluster 4
red/blush wine 0.0192 Cluster 5
liquor 0.0111 Cluster 5

Cluster interpretation:
The dendrogram and cluster profiles naturally reveal shopping personas or product categories — e.g., a dairy + staples cluster, a snack/beverage cluster, a fresh produce cluster, etc. These groupings can guide store layout, shelf placement, and targeted coupon campaigns.


6. Summary & Business Insights

Association Rules

  1. High-lift pairs (lift > 5) represent the strongest cross-sell opportunities — products that are bought together far more than chance would predict.
  2. Whole milk is a hub item appearing in many consequents; bundling promotions around it will reach a large share of shoppers.
  3. Rules with higher confidence (> 0.5) are reliable enough to power recommendation engines at the point of sale.

Clustering

  1. Hierarchical clustering groups items by co-purchase similarity, revealing natural product neighborhoods that can inform store layout and category management.
  2. Items in the same cluster that are not already linked by a strong association rule are underexploited cross-sell opportunities worth investigating.

Analysis performed in R using arules, arulesViz, and base hierarchical clustering.