raw <- read.csv("GroceryDataSet.csv", header = FALSE,
stringsAsFactors = FALSE, fill = TRUE)
# Each row is one transaction; blank cells are structural padding, not data
transactions <- lapply(seq_len(nrow(raw)), function(i) {
row <- as.character(raw[i, ])
items <- trimws(row[row != "" & !is.na(row)])
items[nchar(items) > 0]
})
n_trans <- length(transactions)
all_items <- sort(unique(unlist(transactions)))
n_items <- length(all_items)
basket_sizes <- sapply(transactions, length)
knitr::kable(
data.frame(
Metric = c("Transactions", "Unique items",
"Min basket", "Median basket", "Mean basket", "Max basket"),
Value = c(n_trans, n_items,
min(basket_sizes), median(basket_sizes),
round(mean(basket_sizes), 2), max(basket_sizes))
),
caption = "Dataset overview",
row.names = FALSE
)| Metric | Value |
|---|---|
| Transactions | 9835.00 |
| Unique items | 169.00 |
| Min basket | 1.00 |
| Median basket | 3.00 |
| Mean basket | 4.41 |
| Max basket | 32.00 |
The dataset contains 9835 grocery receipts — one transaction per row — across 169 unique products. The average basket holds 4.4 items, but the distribution is right-skewed: most shoppers buy a handful of staples, while a small number of transactions contain many items. This sparsity is a defining characteristic of market basket data and directly motivates the support threshold used in the Apriori algorithm below.
basket_df <- data.frame(size = basket_sizes)
ggplot(basket_df, aes(x = size)) +
geom_histogram(binwidth = 1, fill = "#2166ac", colour = "white", alpha = 0.85) +
geom_vline(xintercept = mean(basket_sizes), linetype = "dashed",
colour = "#d73027", linewidth = 0.9) +
annotate("text", x = mean(basket_sizes) + 0.7,
y = 1400, label = paste("mean =", round(mean(basket_sizes), 1)),
colour = "#d73027", size = 3.8, hjust = 0) +
labs(
title = "Distribution of Basket Sizes",
subtitle = "Each bin represents one additional item in the transaction",
x = "Items per Transaction", y = "Count"
) +
theme_minimal(base_size = 13)# Build binary incidence matrix: rows = transactions, cols = items
# crossprod() on this matrix computes all pairwise co-occurrence counts
# in a single optimised BLAS call — critical for performance at this scale
mat <- matrix(FALSE, nrow = n_trans, ncol = n_items)
colnames(mat) <- all_items
for (i in seq_along(transactions)) {
mat[i, match(transactions[[i]], all_items)] <- TRUE
}
item_sup <- colSums(mat) / n_trans # support of every singleton
top20_sup <- sort(item_sup, decreasing = TRUE)[1:20]
whole_milk_pct <- round(item_sup["whole milk"] * 100, 1)
knitr::kable(
data.frame(Item = names(top20_sup),
Support = round(top20_sup, 4),
Transactions = as.integer(top20_sup * n_trans),
row.names = NULL),
caption = "Top 20 items by support (proportion of transactions containing each item)"
)| Item | Support | Transactions |
|---|---|---|
| whole milk | 0.2555 | 2513 |
| other vegetables | 0.1935 | 1903 |
| rolls/buns | 0.1839 | 1809 |
| soda | 0.1744 | 1715 |
| yogurt | 0.1395 | 1372 |
| bottled water | 0.1105 | 1087 |
| root vegetables | 0.1090 | 1072 |
| tropical fruit | 0.1049 | 1032 |
| shopping bags | 0.0985 | 969 |
| sausage | 0.0940 | 924 |
| pastry | 0.0890 | 875 |
| citrus fruit | 0.0828 | 814 |
| bottled beer | 0.0805 | 792 |
| newspapers | 0.0798 | 785 |
| canned beer | 0.0777 | 764 |
| pip fruit | 0.0756 | 744 |
| fruit/vegetable juice | 0.0723 | 711 |
| whipped/sour cream | 0.0717 | 705 |
| brown bread | 0.0649 | 638 |
| domestic eggs | 0.0634 | 624 |
top20_df <- data.frame(
item = names(top20_sup),
support = as.numeric(top20_sup)
) %>% mutate(item = reorder(item, support))
ggplot(top20_df, aes(x = item, y = support)) +
geom_col(fill = "#2166ac", alpha = 0.88, width = 0.7) +
geom_text(aes(label = sprintf("%.3f", support)),
hjust = -0.1, size = 3.2, colour = "#333333") +
coord_flip() +
geom_hline(yintercept = 0.01, linetype = "dashed",
colour = "#d73027", linewidth = 0.8) +
annotate("text", y = 0.013, x = 1.5,
label = "min support = 0.01", colour = "#d73027",
size = 3.2, hjust = 0) +
scale_y_continuous(limits = c(0, max(top20_sup) * 1.20)) +
labs(
title = "Top 20 Items by Support",
subtitle = "Whole milk appears in more than 1 in 4 baskets",
x = NULL, y = "Support"
) +
theme_minimal(base_size = 13)Whole milk is by far the most common item, appearing in 25.6% of all transactions. Other vegetables, rolls/buns, and soda round out the top four. The dominance of staple dairy and bakery products is a consistent signal that will recur throughout the association rule analysis.
The Apriori algorithm is implemented in base R without external
packages. Transactions are encoded as a binary matrix and all pairwise
co-occurrence counts are computed in a single crossprod()
call, which makes the search fast even at nearly 10,000
transactions.
Minimum support is set to 0.01 (~98 transactions) and minimum confidence to 0.10. The search extends to 3-itemset rules — two items in the antecedent, one in the consequent — beyond which support falls below the threshold for nearly all combinations in this dataset.
all_rules <- rbind(rules2, rules3)
all_rules <- all_rules[order(-all_rules$lift), ]
rownames(all_rules) <- NULL
knitr::kable(
data.frame(
Metric = c("Total rules", "Support (min)", "Support (median)",
"Support (max)", "Confidence (min)", "Confidence (median)",
"Confidence (max)", "Lift (min)", "Lift (median)", "Lift (max)"),
Value = round(c(nrow(all_rules),
min(all_rules$support), median(all_rules$support), max(all_rules$support),
min(all_rules$confidence), median(all_rules$confidence), max(all_rules$confidence),
min(all_rules$lift), median(all_rules$lift), max(all_rules$lift)), 4)
),
caption = "Summary of all generated association rules",
row.names = FALSE
)| Metric | Value |
|---|---|
| Total rules | 427.0000 |
| Support (min) | 0.0101 |
| Support (median) | 0.0143 |
| Support (max) | 0.0748 |
| Confidence (min) | 0.1007 |
| Confidence (median) | 0.2193 |
| Confidence (max) | 0.5862 |
| Lift (min) | 0.7899 |
| Lift (median) | 1.6171 |
| Lift (max) | 3.3723 |
top10 <- head(all_rules, 10)
top10_display <- top10 %>%
mutate(
support = round(support, 4),
confidence = round(confidence, 4),
lift = round(lift, 4)
)
knitr::kable(
top10_display,
caption = "Top 10 association rules ranked by lift (min support = 0.01, min confidence = 0.10)",
col.names = c("Antecedent", "Consequent", "Support", "Confidence", "Lift"),
row.names = FALSE
)| Antecedent | Consequent | Support | Confidence | Lift |
|---|---|---|---|---|
| whole milk + yogurt | curd | 0.0101 | 0.1797 | 3.3723 |
| citrus fruit + other vegetables | root vegetables | 0.0104 | 0.3592 | 3.2950 |
| other vegetables + yogurt | whipped/sour cream | 0.0102 | 0.2342 | 3.2671 |
| other vegetables + tropical fruit | root vegetables | 0.0123 | 0.3428 | 3.1448 |
| beef | root vegetables | 0.0174 | 0.3314 | 3.0404 |
| root vegetables | beef | 0.0174 | 0.1595 | 3.0404 |
| citrus fruit + root vegetables | other vegetables | 0.0104 | 0.5862 | 3.0296 |
| root vegetables + tropical fruit | other vegetables | 0.0123 | 0.5845 | 3.0210 |
| other vegetables + whole milk | root vegetables | 0.0232 | 0.3098 | 2.8421 |
| other vegetables + whole milk | butter | 0.0115 | 0.1535 | 2.7706 |
Support measures how common a rule is across all transactions. Rule 9 (other vegetables + whole milk → root vegetables) has the highest support in the top 10 at 0.0232, appearing in approximately 228 baskets — meaning it is both statistically reliable and commercially significant in volume.
Confidence measures the conditional probability of the consequent given the antecedent. Rule 7 (citrus fruit + root vegetables → other vegetables) has a confidence of 0.5862: nearly 59% of shoppers who bought citrus fruit and root vegetables also bought other vegetables. This is the most actionable single-rule probability in the top 10.
Lift is the primary sorting metric and the most informative of the three. A lift of 1.0 indicates statistical independence — the items co-occur no more than chance would predict. A lift greater than 1.0 indicates positive association. The top rule (whole milk + yogurt → curd) achieves a lift of 3.3723, meaning shoppers buying both whole milk and yogurt are 3.37× more likely to also buy curd than a randomly selected shopper — a very strong signal from a dairy-aisle cluster.
All ten rules in the table have lift well above 1.0, confirming that these are genuine associations and not statistical accidents. The median lift across all 427 rules is 1.62, indicating that most rules in the full set also represent meaningful positive associations.
The top 10 rules reveal a coherent story: this store has a strong fresh produce and dairy cluster. Root vegetables, other vegetables, citrus fruit, tropical fruit, yogurt, whole milk, butter, and curd all cross-associate with each other. The beef ↔︎ root vegetables pairing (rules 5 and 6, both with lift 3.04) is a distinct cooking cluster — shoppers buying ingredients for a pot roast or stew. These patterns have direct implications for shelf layout, promotional bundling, and cross-category recommendations.
lift_range <- range(all_rules$lift)
pal <- colorRampPalette(c("#3182bd", "#fdae6b", "#e6550d"))(100)
lift_idx <- round(
(all_rules$lift - lift_range[1]) / diff(lift_range) * 99
) + 1
# ggplot version with continuous colour scale
ggplot(all_rules, aes(x = support, y = confidence, colour = lift)) +
geom_point(size = 1.8, alpha = 0.7) +
scale_colour_gradient2(
low = "#3182bd",
mid = "#fdae6b",
high = "#e6550d",
midpoint = median(all_rules$lift),
name = "Lift"
) +
labs(
title = "Association Rules: Support vs. Confidence",
subtitle = paste0("Colour encodes lift | ", nrow(all_rules),
" total rules | min support = ", MIN_SUP,
", min confidence = ", MIN_CONF),
x = "Support", y = "Confidence"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "right")The scatter plot reveals an important structural feature of this rule set: support and confidence are weakly inversely related. High-confidence rules tend to have low support because they involve specific item combinations that are relatively rare even though they are internally consistent (e.g., a shopper who buys citrus fruit AND root vegetables almost always also buys other vegetables, but that precise triplet occurs in only ~1% of all baskets). The highest-lift rules — shown in orange and red — sit in the low-support, moderate-to-high confidence region, exactly where we expect the most surprising and actionable associations to live.
The goal of the cluster analysis is to group transactions into shopper archetypes based on their purchase patterns. This is complementary to the association rules: rules identify item-level relationships; clusters identify shopper-level segments.
The feature matrix uses the top 30 items by support as binary columns. Using all 169 items would introduce excessive sparsity (most transactions contain fewer than 5 items), which destabilises k-means centroids. The top 30 cover the backbone of what most shoppers buy and give the clustering algorithm a meaningful signal to work with.
top30 <- names(sort(item_sup, decreasing = TRUE))[1:30]
cmat <- fmat[, top30]
knitr::kable(
data.frame(Item = top30, Support = round(item_sup[top30], 4), row.names = NULL),
caption = "Top 30 items used as clustering features",
row.names = FALSE
)| Item | Support |
|---|---|
| whole milk | 0.2555 |
| other vegetables | 0.1935 |
| rolls/buns | 0.1839 |
| soda | 0.1744 |
| yogurt | 0.1395 |
| bottled water | 0.1105 |
| root vegetables | 0.1090 |
| tropical fruit | 0.1049 |
| shopping bags | 0.0985 |
| sausage | 0.0940 |
| pastry | 0.0890 |
| citrus fruit | 0.0828 |
| bottled beer | 0.0805 |
| newspapers | 0.0798 |
| canned beer | 0.0777 |
| pip fruit | 0.0756 |
| fruit/vegetable juice | 0.0723 |
| whipped/sour cream | 0.0717 |
| brown bread | 0.0649 |
| domestic eggs | 0.0634 |
| frankfurter | 0.0590 |
| margarine | 0.0586 |
| coffee | 0.0581 |
| pork | 0.0577 |
| butter | 0.0554 |
| curd | 0.0533 |
| beef | 0.0525 |
| napkins | 0.0524 |
| chocolate | 0.0496 |
| frozen vegetables | 0.0481 |
set.seed(42)
wss <- sapply(2:8, function(k) {
kmeans(cmat, centers = k, nstart = 15, iter.max = 100)$tot.withinss
})
elbow_df <- data.frame(k = 2:8, wss = wss)
ggplot(elbow_df, aes(x = k, y = wss)) +
geom_line(colour = "#1a4f72", linewidth = 1.2) +
geom_point(colour = "#1a4f72", size = 3) +
geom_point(data = elbow_df[elbow_df$k == 4, ],
colour = "#d73027", size = 5) +
annotate("text", x = 4.15, y = elbow_df$wss[elbow_df$k == 4] + 100,
label = "k = 4 (chosen)", colour = "#d73027",
size = 3.8, hjust = 0) +
labs(
title = "Elbow Method — Total Within-Cluster Sum of Squares",
subtitle = "The rate of improvement slows noticeably after k = 4",
x = "Number of Clusters (k)", y = "Total Within-Cluster SS"
) +
theme_minimal(base_size = 13)The elbow plot shows a clear inflection at k = 4: adding a fifth cluster reduces within-cluster SS by a substantially smaller margin than moving from k = 3 to k = 4. Four clusters offer a good balance between compactness and parsimony.
set.seed(42)
km4 <- kmeans(cmat, centers = 4, nstart = 25, iter.max = 200)
sizes <- as.data.frame(table(Cluster = km4$cluster))
knitr::kable(sizes,
caption = "Transaction counts per cluster (k = 4)",
row.names = FALSE)| Cluster | Freq |
|---|---|
| 1 | 1257 |
| 2 | 5013 |
| 3 | 1776 |
| 4 | 1789 |
pca <- prcomp(cmat, center = TRUE, scale. = FALSE)
pct_var <- round(100 * pca$sdev^2 / sum(pca$sdev^2), 1)
scores <- as.data.frame(pca$x[, 1:2])
scores$cluster <- factor(km4$cluster)
# Sample 2,000 transactions so individual points are visible
set.seed(1)
samp <- sample(nrow(scores), 2000)
ggplot(scores[samp, ], aes(x = PC1, y = PC2, colour = cluster)) +
geom_point(size = 0.9, alpha = 0.55) +
scale_colour_manual(
values = c("1" = "#e41a1c", "2" = "#377eb8",
"3" = "#4daf4a", "4" = "#ff7f00"),
name = "Cluster"
) +
labs(
title = "k-Means Clusters (k = 4) — PCA Projection",
subtitle = paste0(
"PC1 explains ", pct_var[1], "% of variance; PC2 explains ",
pct_var[2], "% | 2,000-transaction sample"
),
x = paste0("PC1 (", pct_var[1], "%)"),
y = paste0("PC2 (", pct_var[2], "%)")
) +
theme_minimal(base_size = 13)# Average purchase rate per item within each cluster
profiles <- aggregate(cmat,
by = list(Cluster = km4$cluster),
FUN = mean)
profile_mat <- as.matrix(profiles[, -1])
rownames(profile_mat) <- paste0("Cluster ", 1:4)
# Top 5 characteristic items per cluster
profile_tbl <- do.call(rbind, lapply(1:4, function(cl) {
row <- profile_mat[cl, ]
top5 <- names(sort(row, decreasing = TRUE))[1:5]
n_cl <- sum(km4$cluster == cl)
data.frame(
Cluster = cl,
Size = n_cl,
Top5_Items = paste(top5, collapse = ", "),
stringsAsFactors = FALSE
)
}))
knitr::kable(profile_tbl,
caption = "Cluster sizes and top 5 characteristic items",
col.names = c("Cluster", "Transactions", "Top 5 Items"),
row.names = FALSE)| Cluster | Transactions | Top 5 Items |
|---|---|---|
| 1 | 1257 | soda, rolls/buns, bottled water, shopping bags, sausage |
| 2 | 5013 | rolls/buns, canned beer, yogurt, bottled water, shopping bags |
| 3 | 1776 | whole milk, rolls/buns, yogurt, soda, root vegetables |
| 4 | 1789 | other vegetables, whole milk, root vegetables, yogurt, rolls/buns |
# Scale each column to [0,1] so items with naturally high support
# (e.g., whole milk) do not visually dominate less-common items
scaled_mat <- apply(profile_mat, 2,
function(x) (x - min(x)) / (max(x) - min(x) + 1e-9))
heat_df <- melt(scaled_mat)
colnames(heat_df) <- c("Cluster", "Item", "Value")
ggplot(heat_df, aes(x = Item, y = Cluster, fill = Value)) +
geom_tile(colour = "white", linewidth = 0.4) +
scale_fill_gradient(low = "white", high = "#e6550d", name = "Scaled\nRate") +
scale_x_discrete(guide = guide_axis(angle = 55)) +
labs(
title = "Cluster Profiles — Relative Item Purchase Rate",
subtitle = "Scaled 0–1 within each item column; darker = higher relative purchase rate",
x = NULL, y = NULL
) +
theme_minimal(base_size = 11) +
theme(axis.text.y = element_text(size = 10, face = "bold"))The four clusters reveal distinct shopper archetypes, each with a coherent behavioural profile.
Cluster 1 — Convenience / Snack Buyers (n = 1257). Characterised by soda, rolls/buns, bottled water, shopping bags, and sausage. These shoppers are making quick, convenience-oriented purchases — cold drinks, bread, and easy-protein items. They show below-average purchase rates for fresh produce and dairy, suggesting they are not making weekly family shops. This group is a candidate for cross-promotions with snack foods and ready-to-eat meals.
Cluster 2 — Everyday Staples (n = 5013, the largest group). The dominant cluster, representing the broad majority of typical shopping trips. Rolls/buns, canned beer, yogurt, and bottled water are characteristic. This cluster has moderate purchase rates across almost all items — it is the “average” shopper archetype. Its size confirms that most transactions in this dataset are routine, undifferentiated weekly shops rather than targeted specialty trips.
Cluster 3 — Family Basket (n = 1776). Defined by whole milk, rolls/buns, yogurt, soda, and root vegetables. The combination of dairy staples and root vegetables suggests family-oriented shopping with a mix of both functional staples (milk, yogurt) and cooking ingredients (root vegetables). The presence of soda alongside healthier items is consistent with shopping for a household with both adults and children.
Cluster 4 — Fresh and Healthy (n = 1789). The most distinctive archetype: other vegetables, whole milk, root vegetables, yogurt, and rolls/buns. This cluster mirrors the dominant signal from the association rules — the fresh produce and dairy shoppers. These are the buyers most likely to respond to cross-promotions in the vegetable, fruit, and dairy aisles, and they represent the association rules’ target audience.
The PCA projection confirms that the clusters are real but overlapping — there is no clean separation in two dimensions. This is expected: most shoppers share some common staple purchases, and the differences between archetypes are differences in degree (how frequently each item type appears) rather than kind (all-or-nothing membership). The heatmap captures this gradient more clearly than the scatter plot, showing that whole milk and rolls/buns have elevated purchase rates across nearly all clusters while fresh vegetables and specialty items are the key differentiators.
R version 4.5.0 (2025-04-11) Platform: aarch64-apple-darwin20 Running under: macOS 26.4.1
Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York tzcode source: internal
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] reshape2_1.4.4 tidyr_1.3.1 dplyr_1.1.4 ggplot2_4.0.0.9000
loaded via a namespace (and not attached): [1] gtable_0.3.6
jsonlite_2.0.0 compiler_4.5.0 tidyselect_1.2.1
[5] Rcpp_1.1.0 stringr_1.5.2 jquerylib_0.1.4 scales_1.4.0
[9] yaml_2.3.10 fastmap_1.2.0 R6_2.6.1 plyr_1.8.9
[13] labeling_0.4.3 generics_0.1.4 knitr_1.50 tibble_3.3.0
[17] bslib_0.9.0 pillar_1.11.1 RColorBrewer_1.1-3 rlang_1.1.6
[21] cachem_1.1.0 stringi_1.8.7 xfun_0.53 sass_0.4.10
[25] S7_0.2.0 cli_3.6.5 withr_3.0.2 magrittr_2.0.4
[29] digest_0.6.37 grid_4.5.0 rstudioapi_0.17.1 lifecycle_1.0.4
[33] vctrs_0.6.5 evaluate_1.0.5 glue_1.8.0 farver_2.1.2
[37] rmarkdown_2.30 purrr_1.1.0 tools_4.5.0 pkgconfig_2.0.3
[41] htmltools_0.5.8.1