Packages

library(ggplot2)
library(dplyr)
library(tidyr)
library(reshape2)

The Data

Overview

raw <- read.csv("GroceryDataSet.csv", header = FALSE,
                stringsAsFactors = FALSE, fill = TRUE)

# Each row is one transaction; blank cells are structural padding, not data
transactions <- lapply(seq_len(nrow(raw)), function(i) {
  row   <- as.character(raw[i, ])
  items <- trimws(row[row != "" & !is.na(row)])
  items[nchar(items) > 0]
})

n_trans   <- length(transactions)
all_items <- sort(unique(unlist(transactions)))
n_items   <- length(all_items)

basket_sizes <- sapply(transactions, length)

knitr::kable(
  data.frame(
    Metric = c("Transactions", "Unique items",
               "Min basket", "Median basket", "Mean basket", "Max basket"),
    Value  = c(n_trans, n_items,
               min(basket_sizes), median(basket_sizes),
               round(mean(basket_sizes), 2), max(basket_sizes))
  ),
  caption   = "Dataset overview",
  row.names = FALSE
)

Dataset overview
Metric	Value
Transactions	9835.00
Unique items	169.00
Min basket	1.00
Median basket	3.00
Mean basket	4.41
Max basket	32.00

The dataset contains 9835 grocery receipts — one transaction per row — across 169 unique products. The average basket holds 4.4 items, but the distribution is right-skewed: most shoppers buy a handful of staples, while a small number of transactions contain many items. This sparsity is a defining characteristic of market basket data and directly motivates the support threshold used in the Apriori algorithm below.

basket_df <- data.frame(size = basket_sizes)

ggplot(basket_df, aes(x = size)) +
  geom_histogram(binwidth = 1, fill = "#2166ac", colour = "white", alpha = 0.85) +
  geom_vline(xintercept = mean(basket_sizes), linetype = "dashed",
             colour = "#d73027", linewidth = 0.9) +
  annotate("text", x = mean(basket_sizes) + 0.7,
           y = 1400, label = paste("mean =", round(mean(basket_sizes), 1)),
           colour = "#d73027", size = 3.8, hjust = 0) +
  labs(
    title    = "Distribution of Basket Sizes",
    subtitle = "Each bin represents one additional item in the transaction",
    x = "Items per Transaction", y = "Count"
  ) +
  theme_minimal(base_size = 13)

Item Frequencies

# Build binary incidence matrix: rows = transactions, cols = items
# crossprod() on this matrix computes all pairwise co-occurrence counts
# in a single optimised BLAS call — critical for performance at this scale
mat <- matrix(FALSE, nrow = n_trans, ncol = n_items)
colnames(mat) <- all_items
for (i in seq_along(transactions)) {
  mat[i, match(transactions[[i]], all_items)] <- TRUE
}

item_sup       <- colSums(mat) / n_trans   # support of every singleton
top20_sup      <- sort(item_sup, decreasing = TRUE)[1:20]
whole_milk_pct <- round(item_sup["whole milk"] * 100, 1)

knitr::kable(
  data.frame(Item = names(top20_sup),
             Support = round(top20_sup, 4),
             Transactions = as.integer(top20_sup * n_trans),
             row.names = NULL),
  caption = "Top 20 items by support (proportion of transactions containing each item)"
)

Top 20 items by support (proportion of transactions containing each item)
Item	Support	Transactions
whole milk	0.2555	2513
other vegetables	0.1935	1903
rolls/buns	0.1839	1809
soda	0.1744	1715
yogurt	0.1395	1372
bottled water	0.1105	1087
root vegetables	0.1090	1072
tropical fruit	0.1049	1032
shopping bags	0.0985	969
sausage	0.0940	924
pastry	0.0890	875
citrus fruit	0.0828	814
bottled beer	0.0805	792
newspapers	0.0798	785
canned beer	0.0777	764
pip fruit	0.0756	744
fruit/vegetable juice	0.0723	711
whipped/sour cream	0.0717	705
brown bread	0.0649	638
domestic eggs	0.0634	624

top20_df <- data.frame(
  item    = names(top20_sup),
  support = as.numeric(top20_sup)
) %>% mutate(item = reorder(item, support))

ggplot(top20_df, aes(x = item, y = support)) +
  geom_col(fill = "#2166ac", alpha = 0.88, width = 0.7) +
  geom_text(aes(label = sprintf("%.3f", support)),
            hjust = -0.1, size = 3.2, colour = "#333333") +
  coord_flip() +
  geom_hline(yintercept = 0.01, linetype = "dashed",
             colour = "#d73027", linewidth = 0.8) +
  annotate("text", y = 0.013, x = 1.5,
           label = "min support = 0.01", colour = "#d73027",
           size = 3.2, hjust = 0) +
  scale_y_continuous(limits = c(0, max(top20_sup) * 1.20)) +
  labs(
    title    = "Top 20 Items by Support",
    subtitle = "Whole milk appears in more than 1 in 4 baskets",
    x = NULL, y = "Support"
  ) +
  theme_minimal(base_size = 13)

Whole milk is by far the most common item, appearing in 25.6% of all transactions. Other vegetables, rolls/buns, and soda round out the top four. The dominance of staple dairy and bakery products is a consistent signal that will recur throughout the association rule analysis.

Association Rules — Apriori Algorithm

Algorithm Design

The Apriori algorithm is implemented in base R without external packages. Transactions are encoded as a binary matrix and all pairwise co-occurrence counts are computed in a single crossprod() call, which makes the search fast even at nearly 10,000 transactions.

Minimum support is set to 0.01 (~98 transactions) and minimum confidence to 0.10. The search extends to 3-itemset rules — two items in the antecedent, one in the consequent — beyond which support falls below the threshold for nearly all combinations in this dataset.

Frequent 2-Itemsets

Frequent 3-Itemsets and Rule Generation

all_rules <- rbind(rules2, rules3)
all_rules <- all_rules[order(-all_rules$lift), ]
rownames(all_rules) <- NULL

knitr::kable(
  data.frame(
    Metric     = c("Total rules", "Support (min)", "Support (median)",
                   "Support (max)", "Confidence (min)", "Confidence (median)",
                   "Confidence (max)", "Lift (min)", "Lift (median)", "Lift (max)"),
    Value      = round(c(nrow(all_rules),
                   min(all_rules$support),    median(all_rules$support),    max(all_rules$support),
                   min(all_rules$confidence), median(all_rules$confidence), max(all_rules$confidence),
                   min(all_rules$lift),       median(all_rules$lift),       max(all_rules$lift)), 4)
  ),
  caption   = "Summary of all generated association rules",
  row.names = FALSE
)

Summary of all generated association rules
Metric	Value
Total rules	427.0000
Support (min)	0.0101
Support (median)	0.0143
Support (max)	0.0748
Confidence (min)	0.1007
Confidence (median)	0.2193
Confidence (max)	0.5862
Lift (min)	0.7899
Lift (median)	1.6171
Lift (max)	3.3723

Top 10 Rules by Lift

top10 <- head(all_rules, 10)
top10_display <- top10 %>%
  mutate(
    support    = round(support,    4),
    confidence = round(confidence, 4),
    lift       = round(lift,       4)
  )

knitr::kable(
  top10_display,
  caption  = "Top 10 association rules ranked by lift (min support = 0.01, min confidence = 0.10)",
  col.names = c("Antecedent", "Consequent", "Support", "Confidence", "Lift"),
  row.names = FALSE
)

Top 10 association rules ranked by lift (min support = 0.01, min confidence = 0.10)
Antecedent	Consequent	Support	Confidence	Lift
whole milk + yogurt	curd	0.0101	0.1797	3.3723
citrus fruit + other vegetables	root vegetables	0.0104	0.3592	3.2950
other vegetables + yogurt	whipped/sour cream	0.0102	0.2342	3.2671
other vegetables + tropical fruit	root vegetables	0.0123	0.3428	3.1448
beef	root vegetables	0.0174	0.3314	3.0404
root vegetables	beef	0.0174	0.1595	3.0404
citrus fruit + root vegetables	other vegetables	0.0104	0.5862	3.0296
root vegetables + tropical fruit	other vegetables	0.0123	0.5845	3.0210
other vegetables + whole milk	root vegetables	0.0232	0.3098	2.8421
other vegetables + whole milk	butter	0.0115	0.1535	2.7706

Interpreting the Three Metrics

Support measures how common a rule is across all transactions. Rule 9 (other vegetables + whole milk → root vegetables) has the highest support in the top 10 at 0.0232, appearing in approximately 228 baskets — meaning it is both statistically reliable and commercially significant in volume.

Confidence measures the conditional probability of the consequent given the antecedent. Rule 7 (citrus fruit + root vegetables → other vegetables) has a confidence of 0.5862: nearly 59% of shoppers who bought citrus fruit and root vegetables also bought other vegetables. This is the most actionable single-rule probability in the top 10.

Lift is the primary sorting metric and the most informative of the three. A lift of 1.0 indicates statistical independence — the items co-occur no more than chance would predict. A lift greater than 1.0 indicates positive association. The top rule (whole milk + yogurt → curd) achieves a lift of 3.3723, meaning shoppers buying both whole milk and yogurt are 3.37× more likely to also buy curd than a randomly selected shopper — a very strong signal from a dairy-aisle cluster.

All ten rules in the table have lift well above 1.0, confirming that these are genuine associations and not statistical accidents. The median lift across all 427 rules is 1.62, indicating that most rules in the full set also represent meaningful positive associations.

What the Rules Tell Us

The top 10 rules reveal a coherent story: this store has a strong fresh produce and dairy cluster. Root vegetables, other vegetables, citrus fruit, tropical fruit, yogurt, whole milk, butter, and curd all cross-associate with each other. The beef ↔︎ root vegetables pairing (rules 5 and 6, both with lift 3.04) is a distinct cooking cluster — shoppers buying ingredients for a pot roast or stew. These patterns have direct implications for shelf layout, promotional bundling, and cross-category recommendations.

Support vs. Confidence — Rule Landscape

lift_range <- range(all_rules$lift)
pal        <- colorRampPalette(c("#3182bd", "#fdae6b", "#e6550d"))(100)
lift_idx   <- round(
  (all_rules$lift - lift_range[1]) / diff(lift_range) * 99
) + 1

# ggplot version with continuous colour scale
ggplot(all_rules, aes(x = support, y = confidence, colour = lift)) +
  geom_point(size = 1.8, alpha = 0.7) +
  scale_colour_gradient2(
    low      = "#3182bd",
    mid      = "#fdae6b",
    high     = "#e6550d",
    midpoint = median(all_rules$lift),
    name     = "Lift"
  ) +
  labs(
    title    = "Association Rules: Support vs. Confidence",
    subtitle = paste0("Colour encodes lift | ", nrow(all_rules),
                      " total rules | min support = ", MIN_SUP,
                      ", min confidence = ", MIN_CONF),
    x = "Support", y = "Confidence"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "right")

The scatter plot reveals an important structural feature of this rule set: support and confidence are weakly inversely related. High-confidence rules tend to have low support because they involve specific item combinations that are relatively rare even though they are internally consistent (e.g., a shopper who buys citrus fruit AND root vegetables almost always also buys other vegetables, but that precise triplet occurs in only ~1% of all baskets). The highest-lift rules — shown in orange and red — sit in the low-support, moderate-to-high confidence region, exactly where we expect the most surprising and actionable associations to live.

Cluster Analysis — Shopper Archetypes

Design

The goal of the cluster analysis is to group transactions into shopper archetypes based on their purchase patterns. This is complementary to the association rules: rules identify item-level relationships; clusters identify shopper-level segments.

The feature matrix uses the top 30 items by support as binary columns. Using all 169 items would introduce excessive sparsity (most transactions contain fewer than 5 items), which destabilises k-means centroids. The top 30 cover the backbone of what most shoppers buy and give the clustering algorithm a meaningful signal to work with.

top30 <- names(sort(item_sup, decreasing = TRUE))[1:30]
cmat  <- fmat[, top30]

knitr::kable(
  data.frame(Item = top30, Support = round(item_sup[top30], 4), row.names = NULL),
  caption   = "Top 30 items used as clustering features",
  row.names = FALSE
)

Top 30 items used as clustering features
Item	Support
whole milk	0.2555
other vegetables	0.1935
rolls/buns	0.1839
soda	0.1744
yogurt	0.1395
bottled water	0.1105
root vegetables	0.1090
tropical fruit	0.1049
shopping bags	0.0985
sausage	0.0940
pastry	0.0890
citrus fruit	0.0828
bottled beer	0.0805
newspapers	0.0798
canned beer	0.0777
pip fruit	0.0756
fruit/vegetable juice	0.0723
whipped/sour cream	0.0717
brown bread	0.0649
domestic eggs	0.0634
frankfurter	0.0590
margarine	0.0586
coffee	0.0581
pork	0.0577
butter	0.0554
curd	0.0533
beef	0.0525
napkins	0.0524
chocolate	0.0496
frozen vegetables	0.0481

Choosing k — Elbow Method

set.seed(42)
wss <- sapply(2:8, function(k) {
  kmeans(cmat, centers = k, nstart = 15, iter.max = 100)$tot.withinss
})

elbow_df <- data.frame(k = 2:8, wss = wss)

ggplot(elbow_df, aes(x = k, y = wss)) +
  geom_line(colour = "#1a4f72", linewidth = 1.2) +
  geom_point(colour = "#1a4f72", size = 3) +
  geom_point(data = elbow_df[elbow_df$k == 4, ],
             colour = "#d73027", size = 5) +
  annotate("text", x = 4.15, y = elbow_df$wss[elbow_df$k == 4] + 100,
           label = "k = 4 (chosen)", colour = "#d73027",
           size = 3.8, hjust = 0) +
  labs(
    title    = "Elbow Method — Total Within-Cluster Sum of Squares",
    subtitle = "The rate of improvement slows noticeably after k = 4",
    x = "Number of Clusters (k)", y = "Total Within-Cluster SS"
  ) +
  theme_minimal(base_size = 13)

The elbow plot shows a clear inflection at k = 4: adding a fifth cluster reduces within-cluster SS by a substantially smaller margin than moving from k = 3 to k = 4. Four clusters offer a good balance between compactness and parsimony.

Fitting the Final Model

set.seed(42)
km4    <- kmeans(cmat, centers = 4, nstart = 25, iter.max = 200)

sizes <- as.data.frame(table(Cluster = km4$cluster))
knitr::kable(sizes,
             caption = "Transaction counts per cluster (k = 4)",
             row.names = FALSE)

Transaction counts per cluster (k = 4)
Cluster	Freq
1	1257
2	5013
3	1776
4	1789

pca     <- prcomp(cmat, center = TRUE, scale. = FALSE)
pct_var <- round(100 * pca$sdev^2 / sum(pca$sdev^2), 1)
scores  <- as.data.frame(pca$x[, 1:2])
scores$cluster <- factor(km4$cluster)

# Sample 2,000 transactions so individual points are visible
set.seed(1)
samp <- sample(nrow(scores), 2000)

ggplot(scores[samp, ], aes(x = PC1, y = PC2, colour = cluster)) +
  geom_point(size = 0.9, alpha = 0.55) +
  scale_colour_manual(
    values = c("1" = "#e41a1c", "2" = "#377eb8",
               "3" = "#4daf4a", "4" = "#ff7f00"),
    name   = "Cluster"
  ) +
  labs(
    title    = "k-Means Clusters (k = 4) — PCA Projection",
    subtitle = paste0(
      "PC1 explains ", pct_var[1], "% of variance; PC2 explains ",
      pct_var[2], "% | 2,000-transaction sample"
    ),
    x = paste0("PC1 (", pct_var[1], "%)"),
    y = paste0("PC2 (", pct_var[2], "%)")
  ) +
  theme_minimal(base_size = 13)

Cluster Profiles

# Average purchase rate per item within each cluster
profiles    <- aggregate(cmat,
                         by   = list(Cluster = km4$cluster),
                         FUN  = mean)
profile_mat <- as.matrix(profiles[, -1])
rownames(profile_mat) <- paste0("Cluster ", 1:4)

# Top 5 characteristic items per cluster
profile_tbl <- do.call(rbind, lapply(1:4, function(cl) {
  row   <- profile_mat[cl, ]
  top5  <- names(sort(row, decreasing = TRUE))[1:5]
  n_cl  <- sum(km4$cluster == cl)
  data.frame(
    Cluster = cl,
    Size    = n_cl,
    Top5_Items = paste(top5, collapse = ", "),
    stringsAsFactors = FALSE
  )
}))

knitr::kable(profile_tbl,
             caption  = "Cluster sizes and top 5 characteristic items",
             col.names = c("Cluster", "Transactions", "Top 5 Items"),
             row.names = FALSE)

Cluster sizes and top 5 characteristic items
Cluster	Transactions	Top 5 Items
1	1257	soda, rolls/buns, bottled water, shopping bags, sausage
2	5013	rolls/buns, canned beer, yogurt, bottled water, shopping bags
3	1776	whole milk, rolls/buns, yogurt, soda, root vegetables
4	1789	other vegetables, whole milk, root vegetables, yogurt, rolls/buns

# Scale each column to [0,1] so items with naturally high support
# (e.g., whole milk) do not visually dominate less-common items
scaled_mat <- apply(profile_mat, 2,
                    function(x) (x - min(x)) / (max(x) - min(x) + 1e-9))

heat_df <- melt(scaled_mat)
colnames(heat_df) <- c("Cluster", "Item", "Value")

ggplot(heat_df, aes(x = Item, y = Cluster, fill = Value)) +
  geom_tile(colour = "white", linewidth = 0.4) +
  scale_fill_gradient(low = "white", high = "#e6550d", name = "Scaled\nRate") +
  scale_x_discrete(guide = guide_axis(angle = 55)) +
  labs(
    title    = "Cluster Profiles — Relative Item Purchase Rate",
    subtitle = "Scaled 0–1 within each item column; darker = higher relative purchase rate",
    x = NULL, y = NULL
  ) +
  theme_minimal(base_size = 11) +
  theme(axis.text.y = element_text(size = 10, face = "bold"))

Interpreting the Four Clusters

The four clusters reveal distinct shopper archetypes, each with a coherent behavioural profile.

Cluster 1 — Convenience / Snack Buyers (n = 1257). Characterised by soda, rolls/buns, bottled water, shopping bags, and sausage. These shoppers are making quick, convenience-oriented purchases — cold drinks, bread, and easy-protein items. They show below-average purchase rates for fresh produce and dairy, suggesting they are not making weekly family shops. This group is a candidate for cross-promotions with snack foods and ready-to-eat meals.

Cluster 2 — Everyday Staples (n = 5013, the largest group). The dominant cluster, representing the broad majority of typical shopping trips. Rolls/buns, canned beer, yogurt, and bottled water are characteristic. This cluster has moderate purchase rates across almost all items — it is the “average” shopper archetype. Its size confirms that most transactions in this dataset are routine, undifferentiated weekly shops rather than targeted specialty trips.

Cluster 3 — Family Basket (n = 1776). Defined by whole milk, rolls/buns, yogurt, soda, and root vegetables. The combination of dairy staples and root vegetables suggests family-oriented shopping with a mix of both functional staples (milk, yogurt) and cooking ingredients (root vegetables). The presence of soda alongside healthier items is consistent with shopping for a household with both adults and children.

Cluster 4 — Fresh and Healthy (n = 1789). The most distinctive archetype: other vegetables, whole milk, root vegetables, yogurt, and rolls/buns. This cluster mirrors the dominant signal from the association rules — the fresh produce and dairy shoppers. These are the buyers most likely to respond to cross-promotions in the vegetable, fruit, and dairy aisles, and they represent the association rules’ target audience.

The PCA projection confirms that the clusters are real but overlapping — there is no clean separation in two dimensions. This is expected: most shoppers share some common staple purchases, and the differences between archetypes are differences in degree (how frequently each item type appears) rather than kind (all-or-nothing membership). The heatmap captures this gradient more clearly than the scatter plot, showing that whole milk and rolls/buns have elevated purchase rates across nearly all clusters while fresh vegetables and specialty items are the key differentiators.

sessionInfo()

R version 4.5.0 (2025-04-11) Platform: aarch64-apple-darwin20 Running under: macOS 26.4.1

Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York tzcode source: internal

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] reshape2_1.4.4 tidyr_1.3.1 dplyr_1.1.4 ggplot2_4.0.0.9000

loaded via a namespace (and not attached): [1] gtable_0.3.6 jsonlite_2.0.0 compiler_4.5.0 tidyselect_1.2.1
[5] Rcpp_1.1.0 stringr_1.5.2 jquerylib_0.1.4 scales_1.4.0
[9] yaml_2.3.10 fastmap_1.2.0 R6_2.6.1 plyr_1.8.9
[13] labeling_0.4.3 generics_0.1.4 knitr_1.50 tibble_3.3.0
[17] bslib_0.9.0 pillar_1.11.1 RColorBrewer_1.1-3 rlang_1.1.6
[21] cachem_1.1.0 stringi_1.8.7 xfun_0.53 sass_0.4.10
[25] S7_0.2.0 cli_3.6.5 withr_3.0.2 magrittr_2.0.4
[29] digest_0.6.37 grid_4.5.0 rstudioapi_0.17.1 lifecycle_1.0.4
[33] vctrs_0.6.5 evaluate_1.0.5 glue_1.8.0 farver_2.1.2
[37] rmarkdown_2.30 purrr_1.1.0 tools_4.5.0 pkgconfig_2.0.3
[41] htmltools_0.5.8.1

Market Basket Analysis

Association Rules and Cluster Analysis — Groceries Dataset

Kerwin Charles

2026-05-03