Market Basket Analysis uses association rule mining to discover relationships among items that co-occur in transactions. The classic setting is a grocery store: each receipt is a transaction, and the items on it form an itemset. We mine rules of the form
\[\{A, B\} \Rightarrow \{C\}\]
and evaluate them with three key metrics:
| Metric | Formula | Interpretation |
|---|---|---|
| Support | \(P(A \cup B)\) | How often the itemset appears overall |
| Confidence | \(P(B \mid A)\) | How often the rule is correct |
| Lift | \(\frac{P(A \cup B)}{P(A)\,P(B)}\) | How much more likely than random chance |
A lift > 1 indicates a genuine positive association.
# Read as transactions directly — each row is one basket
txns <- read.transactions(
"GroceryDataSet.csv",
format = "basket",
sep = ",",
rm.duplicates = TRUE
)
txns## transactions in sparse format with
## 9835 transactions (rows) and
## 169 items (columns)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
itemFrequencyPlot(
txns,
topN = 20,
type = "relative",
col = colorRampPalette(c("#2c7bb6", "#abd9e9", "#fdae61", "#d7191c"))(20),
main = "Top 20 Items by Relative Frequency",
ylab = "Item Frequency (proportion of transactions)",
cex.names = 0.75
)Top 20 most frequent items across all transactions
Key observations:
We use the Apriori algorithm with:
support = 0.001 — an itemset must appear in at least
~10 transactionsconfidence = 0.25 — the rule must be correct at least
25 % of the timeminlen = 2 — rules must have at least one antecedent
itemThese thresholds balance rule quantity with meaningfulness.
rules <- apriori(
txns,
parameter = list(
support = 0.001,
confidence = 0.25,
minlen = 2,
maxlen = 5
)
)## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.25 0.1 1 none FALSE TRUE 5 0.001 2
## maxlen target ext
## 5 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5
## done [0.01s].
## writing ... [17331 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
## set of 17331 rules
## set of 17331 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 367 6906 8371 1687
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 4.000 3.657 4.000 5.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001017 Min. :0.2500 Min. :0.001017 Min. : 0.9784
## 1st Qu.:0.001118 1st Qu.:0.3115 1st Qu.:0.002542 1st Qu.: 2.1534
## Median :0.001322 Median :0.4000 Median :0.003559 Median : 2.7955
## Mean :0.001917 Mean :0.4387 Mean :0.004982 Mean : 3.0773
## 3rd Qu.:0.001932 3rd Qu.:0.5385 3rd Qu.:0.005186 3rd Qu.: 3.6563
## Max. :0.074835 Max. :1.0000 Max. :0.255516 Max. :35.7158
## count
## Min. : 10.00
## 1st Qu.: 11.00
## Median : 13.00
## Mean : 18.85
## 3rd Qu.: 19.00
## Max. :736.00
##
## mining info:
## data ntransactions support confidence
## txns 9835 0.001 0.25
## call
## apriori(data = txns, parameter = list(support = 0.001, confidence = 0.25, minlen = 2, maxlen = 5))
top10 <- sort(rules, by = "lift", decreasing = TRUE)[1:10]
top10_df <- as(top10, "data.frame") %>%
mutate(
rules = as.character(rules),
support = round(support, 4),
confidence = round(confidence, 4),
lift = round(lift, 4),
count = count
) %>%
select(rules, support, confidence, lift, count)
kable(
top10_df,
caption = "Top 10 Association Rules by Lift",
col.names = c("Rule", "Support", "Confidence", "Lift", "Count"),
align = "lcccc"
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
column_spec(4, bold = TRUE, color = "white",
background = spec_color(top10_df$lift, option = "D"))| Rule | Support | Confidence | Lift | Count | |
|---|---|---|---|---|---|
| 380 | {bottled beer,red/blush wine} => {liquor} | 0.0019 | 0.3958 | 35.7158 | 19 |
| 1147 | {ham,white bread} => {processed cheese} | 0.0019 | 0.3800 | 22.9282 | 19 |
| 379 | {bottled beer,liquor} => {red/blush wine} | 0.0019 | 0.4130 | 21.4936 | 19 |
| 442 | {Instant food products,soda} => {hamburger meat} | 0.0012 | 0.6316 | 18.9957 | 12 |
| 1585 | {curd,sugar} => {flour} | 0.0011 | 0.3235 | 18.6077 | 11 |
| 1497 | {baking powder,sugar} => {flour} | 0.0010 | 0.3125 | 17.9733 | 10 |
| 1146 | {processed cheese,white bread} => {ham} | 0.0019 | 0.4634 | 17.8034 | 19 |
| 1150 | {fruit/vegetable juice,ham} => {processed cheese} | 0.0011 | 0.2895 | 17.4661 | 11 |
| 1588 | {margarine,sugar} => {flour} | 0.0016 | 0.2963 | 17.0414 | 16 |
| 7495 | {root vegetables,sugar,whole milk} => {flour} | 0.0010 | 0.2941 | 16.9161 | 10 |
Reading the table:
plot(
rules,
method = "scatterplot",
measure = c("support", "confidence"),
shading = "lift",
main = "Association Rules: Support vs. Confidence (shaded by Lift)"
)Each point is one rule; higher lift rules appear darker/warmer
Nodes = items, edges = rules; node size ∝ support, edge color ∝ lift
top20 <- sort(rules, by = "lift")[1:20]
plot(
top20,
method = "grouped",
main = "Grouped Matrix of Top 20 Rules by Lift"
)## Available control parameters (with default values):
## k = 20
## aggr.fun = function (x, ...) UseMethod("mean")
## rhs_max = 10
## lhs_label_items = 2
## col = c("#EE0000FF", "#EEEEEEFF")
## groups = NULL
## engine = ggplot2
## verbose = FALSE
Antecedents (rows) × consequents (columns); size = support, color = lift
We cluster the items (not transactions) based on how often they co-occur, using the binary item matrix and hierarchical clustering with Jaccard distance — the natural distance for binary co-occurrence data.
# Keep only items appearing in ≥ 1 % of transactions for a cleaner cluster map
freq_items <- names(itemFrequency(txns)[itemFrequency(txns) >= 0.01])
txns_sub <- txns[, freq_items]
# Item × transaction binary matrix, then transpose to item × item distances
item_mat <- as(txns_sub, "matrix") # transactions × items
item_dist <- dist(t(item_mat), method = "binary") # Jaccard distancehc <- hclust(item_dist, method = "ward.D2")
plot(
hc,
main = "Hierarchical Clustering of Grocery Items\n(Jaccard distance, Ward linkage)",
xlab = "",
ylab = "Height",
cex = 0.75,
col = "steelblue"
)
# Cut into 5 clusters and color the dendrogram
rect.hclust(hc, k = 5, border = c("#e41a1c","#377eb8","#4daf4a","#984ea3","#ff7f00"))Ward-linkage dendrogram of frequent grocery items
clusters <- cutree(hc, k = 5)
cluster_df <- data.frame(
Item = names(clusters),
Cluster = paste("Cluster", clusters)
) %>%
arrange(Cluster, Item)
kable(
cluster_df,
caption = "Item Cluster Assignments (k = 5)",
col.names = c("Item", "Cluster")
) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE) %>%
row_spec(which(cluster_df$Cluster == "Cluster 1"), background = "#fde0dc") %>%
row_spec(which(cluster_df$Cluster == "Cluster 2"), background = "#dceefb") %>%
row_spec(which(cluster_df$Cluster == "Cluster 3"), background = "#dcfbe5") %>%
row_spec(which(cluster_df$Cluster == "Cluster 4"), background = "#f3dcfb") %>%
row_spec(which(cluster_df$Cluster == "Cluster 5"), background = "#fbf3dc")| Item | Cluster |
|---|---|
| UHT-milk | Cluster 1 |
| baking powder | Cluster 1 |
| beverages | Cluster 1 |
| butter milk | Cluster 1 |
| cake bar | Cluster 1 |
| canned fish | Cluster 1 |
| canned vegetables | Cluster 1 |
| cat food | Cluster 1 |
| chewing gum | Cluster 1 |
| cling film/bags | Cluster 1 |
| coffee | Cluster 1 |
| condensed milk | Cluster 1 |
| cream cheese | Cluster 1 |
| curd | Cluster 1 |
| dessert | Cluster 1 |
| detergent | Cluster 1 |
| dish cleaner | Cluster 1 |
| dishes | Cluster 1 |
| flour | Cluster 1 |
| flower (seeds) | Cluster 1 |
| frozen dessert | Cluster 1 |
| frozen fish | Cluster 1 |
| frozen meals | Cluster 1 |
| grapes | Cluster 1 |
| hamburger meat | Cluster 1 |
| hard cheese | Cluster 1 |
| herbs | Cluster 1 |
| hygiene articles | Cluster 1 |
| ice cream | Cluster 1 |
| meat | Cluster 1 |
| misc. beverages | Cluster 1 |
| mustard | Cluster 1 |
| napkins | Cluster 1 |
| oil | Cluster 1 |
| onions | Cluster 1 |
| packaged fruit/vegetables | Cluster 1 |
| pasta | Cluster 1 |
| pickled vegetables | Cluster 1 |
| pot plants | Cluster 1 |
| roll products | Cluster 1 |
| salt | Cluster 1 |
| seasonal products | Cluster 1 |
| semi-finished bread | Cluster 1 |
| sliced cheese | Cluster 1 |
| soft cheese | Cluster 1 |
| spread cheese | Cluster 1 |
| sugar | Cluster 1 |
| white wine | Cluster 1 |
| beef | Cluster 2 |
| berries | Cluster 2 |
| butter | Cluster 2 |
| chicken | Cluster 2 |
| citrus fruit | Cluster 2 |
| domestic eggs | Cluster 2 |
| frozen vegetables | Cluster 2 |
| margarine | Cluster 2 |
| other vegetables | Cluster 2 |
| pip fruit | Cluster 2 |
| pork | Cluster 2 |
| root vegetables | Cluster 2 |
| tropical fruit | Cluster 2 |
| whipped/sour cream | Cluster 2 |
| whole milk | Cluster 2 |
| yogurt | Cluster 2 |
| bottled beer | Cluster 3 |
| bottled water | Cluster 3 |
| brown bread | Cluster 3 |
| canned beer | Cluster 3 |
| frankfurter | Cluster 3 |
| fruit/vegetable juice | Cluster 3 |
| newspapers | Cluster 3 |
| pastry | Cluster 3 |
| rolls/buns | Cluster 3 |
| sausage | Cluster 3 |
| shopping bags | Cluster 3 |
| soda | Cluster 3 |
| candy | Cluster 4 |
| chocolate | Cluster 4 |
| ham | Cluster 4 |
| long life bakery product | Cluster 4 |
| processed cheese | Cluster 4 |
| salty snack | Cluster 4 |
| specialty bar | Cluster 4 |
| specialty chocolate | Cluster 4 |
| waffles | Cluster 4 |
| white bread | Cluster 4 |
| liquor | Cluster 5 |
| red/blush wine | Cluster 5 |
# Frequency of each item within each cluster
freq_vec <- itemFrequency(txns_sub)
profile_df <- data.frame(
Item = names(freq_vec),
Frequency = round(freq_vec, 4),
Cluster = paste("Cluster", clusters[names(freq_vec)])
) %>%
group_by(Cluster) %>%
arrange(desc(Frequency), .by_group = TRUE) %>%
slice_head(n = 5)
kable(
profile_df,
caption = "Top 5 Items per Cluster by Transaction Frequency",
col.names = c("Item", "Frequency", "Cluster")
) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE)| Item | Frequency | Cluster |
|---|---|---|
| coffee | 0.0581 | Cluster 1 |
| curd | 0.0533 | Cluster 1 |
| napkins | 0.0524 | Cluster 1 |
| cream cheese | 0.0397 | Cluster 1 |
| dessert | 0.0371 | Cluster 1 |
| whole milk | 0.2555 | Cluster 2 |
| other vegetables | 0.1935 | Cluster 2 |
| yogurt | 0.1395 | Cluster 2 |
| root vegetables | 0.1090 | Cluster 2 |
| tropical fruit | 0.1049 | Cluster 2 |
| rolls/buns | 0.1839 | Cluster 3 |
| soda | 0.1744 | Cluster 3 |
| bottled water | 0.1105 | Cluster 3 |
| shopping bags | 0.0985 | Cluster 3 |
| sausage | 0.0940 | Cluster 3 |
| chocolate | 0.0496 | Cluster 4 |
| white bread | 0.0421 | Cluster 4 |
| waffles | 0.0384 | Cluster 4 |
| salty snack | 0.0378 | Cluster 4 |
| long life bakery product | 0.0374 | Cluster 4 |
| red/blush wine | 0.0192 | Cluster 5 |
| liquor | 0.0111 | Cluster 5 |
Cluster interpretation:
The dendrogram and cluster profiles naturally reveal shopping
personas or product categories — e.g., a dairy
+ staples cluster, a snack/beverage cluster, a fresh produce cluster,
etc. These groupings can guide store layout, shelf placement, and
targeted coupon campaigns.
Analysis performed in R using arules,
arulesViz, and base hierarchical clustering.