Market Basket Analysis Based on Association Rules

Introduction

In this project, I use the transaction data from the Instacart Dataset. The project covers the full workflow from data loading, sampling, association rule mining, to visualization and interpretation. To handle performance bottlenecks in large-scale transaction data, I adopt data processing and memory optimization strategies: dataset loading with data.table::fread, order-based sampling (20%–50% of orders), explicit garbage collection (gc()), and Apriori with maxlen to avoid rule explosion. The Apriori algorithm is the primary method used in this project for frequent itemset mining. I also introduce the FP-Growth algorithm (via the high-performance fim4r backend) and compare its execution time with Apriori to demonstrate scalability advantages on large transaction data. I also add supplementary methods, such as Kulczynski and Certainty Factor, for robustness measures. Results show that the “organic agricultural products” category exhibits clear co-occurrence patterns, suitable for platform cross-recommendations and combination marketing.

1. Background and Data Overview

Instacart is a leading online grocery platform in North America. The original dataset contains millions of historical orders over 30,000 products. Those high-dimensional, sparse transaction records make traditional frequency statistics insufficient to capture synergistic relationships among products. Therefore, I employ sampling and market basket analysis to mine “co-purchase behaviors.”

Data Source: Instacart Market Basket Dataset
Sampling Strategy: Use sampling (e.g. 20%-50% of all orders) to balance computational cost and statistical representativeness.

2. Method Description

2.1 The Association Evaluation Metrics

In the context of the Instacart dataset, the primary challenge is: how to distinguish meaningful purchasing patterns from random co-occurrences. Simple frequency counts are insufficient due to the scale and sparsity of the transaction matrix. To address this, I employ a three phases evaluation framework:

Support (\(P(A \cap B)\)): Filtering Noise.

Support measures the joint probability of itemset \(A\) and \(B\) co-occurring in a transaction. Its primary role is noise filtering—by setting a min_support threshold, I eliminate rare itemsets that lack statistical significance, ensuring that subsequent analysis focuses on patterns with sufficient data coverage.

\[Support(A \to B) = P(A \cap B) = \frac{\mathrm{count}(A \cap B)}{N}\]

Confidence (\(P(B|A)\)): Measuring Reliability.

Confidence assesses the conditional probability: given that a customer buys \(A\), how likely are they to buy \(B\)? It quantifies the strength of the implication. However, Confidence alone can be misleading if the consequent \(B\) is extremely popular (e.g., Bananas), leading to high confidence purely by chance.

\[Confidence(A \to B) = P(B|A) = \frac{Support(A \cap B)}{Support(A)}\]

Lift: Correcting Popularity Bias.

To solve the “popularity bias” inherent in Confidence, I introduce Lift. It compares the observed support of the rule with the expected support if \(A\) and \(B\) were independent. A Lift > 1 indicates a positive correlation, meaning the occurrence of \(A\) actually boosts the probability of \(B\), rather than \(B\) just being popular on its own.

\[Lift(A \to B) = \frac{P(B|A)}{P(B)} = \frac{Confidence(A \to B)}{Support(B)}\]

2.2 The Apriori Algorithm

The search space for frequent itemsets grows exponentially with the number of products (\(2^d\)), creating a “combinatorial explosion” problem that is computationally intractable for 30,000+ products.

To handle this computational cost, I adopt the Apriori Algorithm. It utilizes the downward closure property (or monotonicity principle): if an itemset is infrequent, all its supersets must also be infrequent.

Mechanism: The algorithm employs a level-wise search strategy. It generates candidate itemsets of length \(k\) only from frequent itemsets of length \(k-1\).

Advantage: This strategy drastically prunes the search space by eliminating candidates early, making the mining process feasible for large-scale datasets like Instacart.

2.3 Advanced Interest Measures

Standard metrics (Support/Confidence) are not “null-invariant” and can be skewed by the imbalance of the data (e.g., highly frequent vs. rare items). To ensure the robustness of our findings, I hereby introduce two advanced metrics:

Kulczynski Measure:

The arithmetic mean of conditional probabilities. Unlike Confidence, Kulczynski is symmetric and treats the antecedent and consequent equally. It is particularly robust in imbalanced datasets, ensuring that relationships are not dominated by the majority class.

\[Kulc(A, B) = \frac{1}{2}(P(B|A) + P(A|B))\]

Certainty Factor (CF):

Originating from expert systems, CF measures the net confidence gain. It answers the question: How much does knowing A increase our belief in B, relative to our prior belief?

\[CF(A \to B) = \frac{Confidence(A \to B) - Support(B)}{1 - Support(B)}\]

CF measure ranges from [-1, 1]. A positive CF indicates that purchasing \(A\) genuinely increases the likelihood of purchasing \(B\). This measure effectively isolating the “pure” association strength from the background popularity of \(B\).

3. Experimental Design and Implementation

3.1 Environment Configuration and Package Management

To keep the analysis reproducible and stable on the dataset, I record the R version, load the exact packages used in later steps, and keep the data path explicit. This section documents the computing environment, ensures that the required libraries are available before processing begins, and prevents silent failures caused by missing dependencies.

If you run this notebook on another machine, update data_dir and re-run this cell.

# R version : 4.3.1
# Set data directory (adjust to your environment)
data_dir <- "/Users/moondudu/moon/jiedan/20260204/提交版本-Apriori-260204/dataset/Instacart_Market_Basket_Analysis/archive"

# Set CRAN mirror (required when running non-interactively, e.g. knitting)
if (is.null(getOption("repos")) || getOption("repos") == "@CRAN@") {
  options(repos = c(CRAN = "https://cloud.r-project.org"))
}

# Install / load packages
if (!require("arules")) install.packages("arules")
if (!require("arulesViz")) install.packages("arulesViz")
if (!require("data.table")) install.packages("data.table")
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("circlize")) install.packages("circlize")
if (!require("RColorBrewer")) install.packages("RColorBrewer")
if (!require("fim4r")) install.packages("fim4r")
if (!require("kableExtra")) install.packages("kableExtra")
if (!require("formattable")) install.packages("formattable")

library(arules)
library(arulesViz)
library(data.table)
library(tidyverse)
library(circlize)
library(RColorBrewer)
library(fim4r)
library(kableExtra)
library(formattable)
cat("arules version:", as.character(packageVersion("arules")), "\n")
## arules version: 1.7.13
library(ggplot2)
library(tidyr)
library(arulesSequences)

3.2 Data Loading and Preprocessing

3.2.1 Data Loading

I use data.table::fread for fast, memory-efficient reads and load only order_id and product_id to keep the working set compact. The products table is used to map IDs to names.

products <- fread(file.path(data_dir, "products.csv"))
order_items <- fread(file.path(data_dir, "order_products__prior.csv"), select = c("order_id", "product_id"))
cat("Total rows in original data:", nrow(order_items), "\n")
## Total rows in original data: 32434489

3.2.2 Large-Scale Order-Based Sampling

I sample a fraction of all orders (e.g. 30%; adjust to 20%-50% depending on memory) with set.seed(2026) for reproducibility. Sampling is done by unique order_id so each basket remains intact. I retain full-data Top 10 for representativeness checks, then release the full table and run gc() to free memory.

# Large-scale order-based sampling (e.g. 30% of orders; use 20%-50% depending on memory)
set.seed(2026)
all_orders <- unique(order_items$order_id)
sample_frac <- 0.3   # adjust: 0.2--0.5 for large-scale analysis
sample_ids <- sample(all_orders, size = length(all_orders) * sample_frac)

# Sampling representativeness: compute full Top 10 before releasing full data (for comparison with sample)
full_top10 <- order_items[, .N, by = product_id][order(-N)][1:10]
setnames(full_top10, "N", "full_count")

# Filter data and merge product names
data_sub <- order_items[order_id %in% sample_ids]
data_sub <- merge(data_sub, products[, .(product_id, product_name)], by = "product_id")

# Release full data and intermediate objects; force garbage collection
rm(order_items, all_orders)
gc()
##            used  (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells  3339334 178.4    5332521  284.8         NA   5332521  284.8
## Vcells 26151267 199.6  153674407 1172.5      36864 190183848 1451.0
cat("Data ready. Sample contains", length(sample_ids), "baskets (", round(sample_frac * 100, 0), "% of orders).\n")
## Data ready. Sample contains 964462 baskets ( 30 % of orders).

3.3 Transaction Conversion

Association-rule mining in arules expects a transactions object, so I transform each order into a basket of product names and then cast the list into the sparse transaction format. The split by order_id approach is memory-efficient because it avoids constructing large dense matrices, and it keeps each basket intact for valid co-occurrence counting. After this step, the analysis are operated on the transaction object rather than the raw order table.

Summary Statistics of trans:

  • Total baskets (transactions): 964,462
  • Total unique items: 48,452
  • Matrix density: 0.02% (extremely sparse, typical for retail basket data)
  • Average basket size: 10.1 items (median = 8, range: 1–137)
  • Most frequent items: Banana (142,206), Bag of Organic Bananas (113,752), Organic Strawberries (79,758), Organic Baby Spinach (72,143), Organic Hass Avocado (64,200)

The distribution of basket sizes is right-skewed: 50% of baskets contain 8 or fewer items, while only 25% exceed 14 items. The dominance of organic produce among top items suggests a health-conscious customer segment, which will likely surface in the association rules.

# 4. Convert to transactions object
trans <- as(split(data_sub$product_name, data_sub$order_id), "transactions")
summary(trans)
## transactions as itemMatrix in sparse format with
##  964462 rows (elements/itemsets/transactions) and
##  48452 columns (items) and a density of 0.0002084263 
## 
## most frequent items:
##                 Banana Bag of Organic Bananas   Organic Strawberries 
##                 142206                 113752                  79758 
##   Organic Baby Spinach   Organic Hass Avocado                (Other) 
##                  72143                  64200                9267727 
## 
## element (itemset/transaction) length distribution:
## sizes
##     1     2     3     4     5     6     7     8     9    10    11    12    13 
## 47229 56164 61916 66434 68545 68200 65806 61059 55263 49513 44014 39538 35033 
##    14    15    16    17    18    19    20    21    22    23    24    25    26 
## 31073 27526 24484 21247 18820 16546 14538 12474 11068  9468  8211  7196  6162 
##    27    28    29    30    31    32    33    34    35    36    37    38    39 
##  5261  4591  3881  3431  2937  2453  2086  1799  1528  1329  1133   955   828 
##    40    41    42    43    44    45    46    47    48    49    50    51    52 
##   676   606   462   421   385   307   273   212   176   170   141   123    97 
##    53    54    55    56    57    58    59    60    61    62    63    64    65 
##    83    78    68    56    61    43    44    27    28    27    19    12    14 
##    66    67    68    69    70    71    72    73    74    75    76    77    78 
##     7    13     5     4     7     9     8     4     7     3     4     2     2 
##    79    80    81    82    83    84    85    86    87    88    90    91    92 
##     2     5     2     3     1     4     1     2     2     1     1     1     4 
##    93    96    98    99   100   102   108   121   127   137 
##     1     1     1     1     1     1     1     1     1     1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     5.0     8.0    10.1    14.0   137.0 
## 
## includes extended item information - examples:
##                                          labels
## 1                        .5\\"" Waterproof Tape
## 2                      'Swingtop' Premium Lager
## 3 (70% Juice!) Mountain Raspberry Juice Squeeze
## 
## includes extended transaction information - examples:
##   transactionID
## 1             5
## 2            20
## 3            22

3.4 Data Analysis

Here I start with item frequency to establish a baseline of the most popular products and to identify potential “hub” items that appear in many baskets. This step is important because high-frequency items can dominate association rules if not interpreted carefully, and they often serve as anchor products in recommendation strategies. The Top 20 frequency plot provides a quick sanity check on whether the sample aligns with intuitive shopping behavior.

The frequency analysis reveals a distinct dominance of fresh produce over other categories. “Banana” and “Bag of Organic Bananas” emerge as the absolute market leaders, exhibiting a significantly higher frequency than other items. Notably, the top tier is heavily populated by “Organic” labeled products (e.g., Organic Strawberries, Organic Baby Spinach), indicating a user base with strong health-conscious preferences. This distribution confirms that the dataset primarily captures weekly fresh grocery restocking behavior, suggesting these high-frequency staples will likely act as strong “hubs” in our following association rules.

# Ensure RColorBrewer is loaded
if (!require("RColorBrewer")) install.packages("RColorBrewer")
library(RColorBrewer)

# Plot top 20 item frequency
itemFrequencyPlot(trans, topN = 20, col = brewer.pal(8, "Pastel2"), main = "Instacart Top 20 Items")

3.4.1 Order Time Heatmap

I visualize order counts by hour of day and day of week to detect stable temporal shopping patterns. A large, order-based sample reduces random noise, allowing clear identification of peak windows (e.g., evening or weekend spikes) that can support timing-based promotions or staffing decisions. The heatmap also helps validate whether the sampling reflects typical consumer routines rather than anomalies.

The heatmap generated reveals a distinct “Weekend Daytime” dominance. The strongest peaks cluster heavily on Day 0 and Day 1 (likely Saturday and Sunday) from 10:00 to 16:00, representing the prime grocery shopping window where server load and delivery demand are highest. Weekdays (Days 2-6) show a flatter activity curve with moderate engagement. Conversely, deep “valleys” are consistently observed between 00:00 and 06:00 across all days, marking the ideal downtime for system maintenance, data backups, or restocking without impacting user experience.

# Load order-level time info (if available)
orders_path <- file.path(data_dir, "orders.csv")
if (file.exists(orders_path)) {
  orders <- fread(orders_path, select = c("order_id", "order_dow", "order_hour_of_day"))
  orders_sample <- orders[order_id %in% sample_ids]
  rm(orders)
  gc()
  order_time_agg <- orders_sample[, .N, by = .(order_dow, order_hour_of_day)]
  ggplot(order_time_agg, aes(x = order_hour_of_day, y = order_dow, fill = N)) +
    geom_tile() +
    scale_fill_gradient(low = "white", high = "darkblue") +
    labs(title = "Order time heatmap (sample): order_hour_of_day vs order_dow",
         x = "Hour of day", y = "Day of week", fill = "Orders") +
    theme_minimal()
} else {
  cat("orders.csv not found; skipping order time heatmap.\n")
}

3.4.2 Long-tail Analysis (Rank-Frequency)

The rank-frequency (log-log) plot checks whether product demand is highly skewed, which is expected in retail baskets. A heavy tail implies that a small fraction of products contributes a large share of volume, while many niche items have low but meaningful demand. This pattern motivates two complementary strategies: prioritize associations among top items for broad impact, and explore long-tail niches for differentiation and targeted marketing.

The results indicate an extreme concentration of demand, surpassing the classic Pareto Principle (80/20 rule). Specifically, the top 20% of ranked products generate approximately 90.7% of the total transaction volume. This “Hyper-Pareto” distribution reveals that the platform’s core revenue is heavily dependent on a consolidated set of staples. Consequently, while the long tail offers variety, the immediate strategic priority for association mining lies within this top tier, as optimizing these high-velocity items will impact over 90% of user baskets.

# Rank-Frequency (long-tail): product_id count by rank
prod_freq <- data_sub[, .N, by = product_id][order(-N)]
prod_freq[, rank := .I]
ggplot(prod_freq, aes(x = rank, y = N)) +
  geom_line(linewidth = 0.3, color = "steelblue") +
  scale_x_log10() + scale_y_log10() +
  labs(title = "Long-tail: product Rank vs Frequency (log-log)",
       x = "Product rank", y = "Frequency (purchases)") +
  theme_minimal()

# Optional: cumulative share (e.g. top 20% of products → % of volume, 80/20 check)
prod_freq[, cum_N := cumsum(N)]
prod_freq[, cum_pct := cum_N / max(cum_N) * 100]
idx_20 <- max(1, round(0.2 * nrow(prod_freq)))
cat("Top 20% of products (by rank) account for",
    round(prod_freq[rank == idx_20]$cum_pct, 1), "% of volume (approx).\n")
## Top 20% of products (by rank) account for 90.7 % of volume (approx).

3.5 Sampling Representativeness

Because I only mine a fraction of orders, here I compare the sample’s Top 10 item distribution to the full dataset to verify whether the sampling preserves major purchasing patterns. A close alignment suggests that the sampling strategy is statistically representative, while large deviations may indicate the need for a higher sampling fraction or stratified sampling. This check guards against biased rules caused by an unbalanced sample.

The validation results demonstrate a good alignment between the sample and the full dataset. We observe a 100% overlap in the Top 10 items, with zero rank shifts—meaning every product from “Bag of Organic Bananas” (Rank 1) to “Organic Avocado” (Rank 10) maintained its exact position in both the sample and the whole population. This indicates that our sampling strategy has preserved the distributional characteristics of the original data, providing a highly reliable foundation for subsequent association rule mining.

# Sample Top 10 items (by product_id count)
sample_top10 <- data_sub[, .N, by = product_id][order(-N)][1:10]
setnames(sample_top10, "N", "sample_count")

# Merge product names and compare
full_top10 <- merge(full_top10, products[, .(product_id, product_name)], by = "product_id")
sample_top10 <- merge(sample_top10, products[, .(product_id, product_name)], by = "product_id")

# Merge into comparison table (aligned by sample rank)
compare_top10 <- merge(
  sample_top10[, .(product_id, product_name, sample_count, sample_rank = 1:.N)],
  full_top10[, .(product_id, full_count, full_rank = 1:.N)],
  by = "product_id", all.x = TRUE
)

knitr::kable(compare_top10, digits = 2, format = "html",
             col.names = c("Product ID", "Product Name", "Sample Count", "Sample Rank", 
                          "Full Count", "Full Rank")) %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    font_size = 12
  )
Product ID Product Name Sample Count Sample Rank Full Count Full Rank
13176 Bag of Organic Bananas 113752 1 379450 1
16797 Strawberries 42971 2 142951 2
21137 Organic Strawberries 79758 3 264683 3
21903 Organic Baby Spinach 72143 4 241921 4
24852 Banana 142206 5 472565 5
26209 Limes 42106 6 140627 6
27845 Organic Whole Milk 41542 7 137905 7
47209 Organic Hass Avocado 64200 8 213584 8
47626 Large Lemon 45895 9 152657 9
47766 Organic Avocado 53073 10 176815 10
# Plot: sample vs full Top 10 frequency (normalized for comparison)
compare_top10[, sample_pct := sample_count / sum(sample_count) * 100]
compare_top10[, full_pct := full_count / sum(full_count, na.rm = TRUE) * 100]
# Long format for ggplot (keep rows with full data, or fill full with 0)
compare_top10[is.na(full_pct), full_pct := 0]
compare_long <- pivot_longer(compare_top10, cols = c(sample_pct, full_pct),
  names_to = "data_source", values_to = "pct")
compare_long$data_source <- ifelse(compare_long$data_source == "sample_pct", "Sample (order-based)", "Full data")

ggplot(compare_long, aes(x = reorder(product_name, -sample_rank), y = pct, fill = data_source)) +
  geom_col(position = "dodge", width = 0.7) +
  scale_fill_manual(values = c("Full data" = "steelblue", "Sample (order-based)" = "coral")) +
  labs(title = "Sample vs Full: Top 10 Item Distribution", x = "Item", y = "Share (%)", fill = "Data") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

3.6 Redundant Rules and Apriori

I then run Apriori with tuned parameters tailored for the large-scale dataset (964,462 transactions). To prevent combinatorial explosion and memory overflow, I set a rigorous support threshold (supp=0.005, equivalent to ~4,800 absolute occurrences) and restricted maxlen=4. Additionally, the script calculated Kulczynski and Certainty Factor (CF) to assess robustness, removed redundant rules, and initially filtered for “Strong Rules” (Lift > 3, Confidence > 0.5).

3.6.1 Empirical Results & Interpretation:

The rigorous support threshold acted as a strong filter, distilling the noise down to 18 high-confidence association rules. While no rules met the strict simultaneous cutoff of Lift > 3 and Confidence > 0.5, an inspection of the top rules sorted by Lift reveals highly significant purchasing patterns:

  1. The “Culinary Complementarity” Pattern: The top-ranked rule, {Organic Cilantro} => {Limes}, exhibits the highest Lift of 5.79.

Interpretation: This strong positive correlation (Lift >> 1) reflects a specific culinary usage scenario (likely Mexican or Asian cuisine). Although the confidence is 0.25 (meaning only 25% of cilantro buyers also bought limes), the high Lift confirms that cilantro buyers are nearly 6 times more likely to buy limes than the average shopper. This is a classic example of a “complementary product” relationship suitable for bundle promotions.

  1. The “Banana Hub” Phenomenon: Rules 2 through 10 are entirely dominated by fruit-to-banana associations (e.g., {Organic Fuji Apple} => {Banana}).

Interpretation: These rules show moderate Lift (2.2 to 2.6) but higher Confidence (~30-38%). This confirms our finding that Bananas serve as the “default consequent” or “hub item” in the ecosystem. While these rules represent stable, high-frequency behavior (weekly fruit restocking), they are less about specific pairing intent and more about general grocery routine.

# Run Apriori with tuned params for large-scale data: higher supp, maxlen=4
rules_all <- apriori(trans, parameter = list(supp = 0.005, conf = 0.25, maxlen = 4, target = "rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5   0.005      1
##  maxlen target  ext
##       4  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 4822 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[48452 item(s), 964462 transaction(s)] done [1.63s].
## sorting and recoding items ... [252 item(s)] done [0.04s].
## creating transaction tree ... done [0.35s].
## checking subsets of size 1 2 3 done [0.05s].
## writing ... [18 rule(s)] done [0.00s].
## creating S4 object  ... done [0.14s].
# Compute Kulczynski and Certainty Factor
quality(rules_all)$kulczynski <- interestMeasure(rules_all, measure = "kulczynski", transactions = trans)
quality(rules_all)$certainty <- interestMeasure(rules_all, measure = "certainty", transactions = trans)

# Redundant rule removal: rule counts before/after
n_before <- length(rules_all)
rules <- rules_all[!is.redundant(rules_all)]
n_after <- length(rules)
cat("Rules before pruning:", n_before, " | after:", n_after, "\n")
## Rules before pruning: 18  | after: 18
# Sort by Lift; strong-rule subset for display (Lift > 3 and Confidence > 0.5)
rules_sorted <- sort(rules, by = "lift", descending = TRUE)
rules_strong <- subset(rules_sorted, lift > 3 & confidence > 0.5)
cat("Strong rules (Lift>3, Confidence>0.5):", length(rules_strong), "\n")
## Strong rules (Lift>3, Confidence>0.5): 0
# Display top rules using inspect
inspect(head(rules_sorted, 10))
##      lhs                                       rhs                          support confidence   coverage     lift count kulczynski certainty
## [1]  {Organic Cilantro}                     => {Limes}                  0.005439302  0.2527340 0.02152184 5.789017  5246  0.1886622 0.2186210
## [2]  {Organic Large Extra Fancy Fuji Apple} => {Bag of Organic Bananas} 0.007272448  0.3089323 0.02354059 2.619325  7014  0.1852964 0.2165268
## [3]  {Organic Fuji Apple}                   => {Banana}                 0.010606950  0.3789591 0.02798970 2.570156 10230  0.2254485 0.2715524
## [4]  {Organic Raspberries}                  => {Bag of Organic Bananas} 0.012699308  0.2966264 0.04281247 2.514988 12248  0.2021496 0.2025753
## [5]  {Organic Hass Avocado}                 => {Bag of Organic Bananas} 0.019545612  0.2936293 0.06656561 2.489576 18851  0.2296747 0.1991775
## [6]  {Honeycrisp Apple}                     => {Banana}                 0.008993615  0.3606953 0.02493411 2.446288  8674  0.2108456 0.2501300
## [7]  {Apple Honeycrisp Organic}             => {Bag of Organic Bananas} 0.007302517  0.2789970 0.02617418 2.365514  7043  0.1704562 0.1825887
## [8]  {Organic Cucumber}                     => {Bag of Organic Bananas} 0.006616124  0.2674126 0.02474125 2.267295  6381  0.1617542 0.1694553
## [9]  {Cucumber Kirby}                       => {Banana}                 0.010162142  0.3334921 0.03047191 2.261793  9801  0.2012066 0.2182222
## [10] {Organic Gala Apples}                  => {Bag of Organic Bananas} 0.005886183  0.2592593 0.02270385 2.198165  5677  0.1545830 0.1602117
# Also create formatted table view
cat("\n=== Top 10 Rules by Lift (Table View) ===\n")
## 
## === Top 10 Rules by Lift (Table View) ===
top_rules_df <- as(head(rules_sorted, 10), "data.frame")
knitr::kable(top_rules_df, digits = 2, format = "html") %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    font_size = 12
  ) %>%
  kableExtra::scroll_box(width = "100%", height = "300px")
rules support confidence coverage lift count kulczynski certainty
4 {Organic Cilantro} => {Limes} 0.01 0.25 0.02 5.79 5246 0.19 0.22
6 {Organic Large Extra Fancy Fuji Apple} => {Bag of Organic Bananas} 0.01 0.31 0.02 2.62 7014 0.19 0.22
9 {Organic Fuji Apple} => {Banana} 0.01 0.38 0.03 2.57 10230 0.23 0.27
15 {Organic Raspberries} => {Bag of Organic Bananas} 0.01 0.30 0.04 2.51 12248 0.20 0.20
18 {Organic Hass Avocado} => {Bag of Organic Bananas} 0.02 0.29 0.07 2.49 18851 0.23 0.20
7 {Honeycrisp Apple} => {Banana} 0.01 0.36 0.02 2.45 8674 0.21 0.25
10 {Apple Honeycrisp Organic} => {Bag of Organic Bananas} 0.01 0.28 0.03 2.37 7043 0.17 0.18
11 {Organic Cucumber} => {Bag of Organic Bananas} 0.01 0.27 0.02 2.27 6381 0.16 0.17
13 {Cucumber Kirby} => {Banana} 0.01 0.33 0.03 2.26 9801 0.20 0.22
3 {Organic Gala Apples} => {Bag of Organic Bananas} 0.01 0.26 0.02 2.20 5677 0.15 0.16
# Subset to positive correlation (Lift > 1) for quality analysis
rules_positive <- subset(rules_sorted, lift > 1)
cat("Number of positive correlation rules:", length(rules_positive), "\n")
## Number of positive correlation rules: 18

3.7 Parameter Sensitivity Analysis

Apriori is sensitive to minimum support and confidence, so I run a small grid search to observe how rule counts change under different thresholds. This provides justification for the chosen parameters by balancing two competing goals: avoiding rule explosion (too many low-quality rules) and retaining meaningful associations (too strict thresholds may discard valuable signals). The heatmap makes these trade-offs visible at a glance.

The parameter sensitivity analysis reveals a dramatic non-linear decay in rule quantity. At the lenient end (supp=0.001, conf=0.1), the algorithm generates an overwhelming 1,838 rules, likely filled with noise. Conversely, tightening the constraints to supp=0.005 and conf=0.3 nearly extinguishes all signals, yielding only 5 rules, which is too restrictive for analysis. The “sweet spot” appears around supp=0.005 with conf=0.1 (95 rules) or supp=0.002 with conf=0.2 (220 rules). These combinations offer a manageable volume of rules for manual inspection while ensuring sufficient statistical backing. For our final model, I prioritize higher confidence to ensure reliability, accepting a smaller but higher-quality set of rules.

# Grid search: rule count for varying supp / conf
support_levels <- c(0.001, 0.002, 0.005)
conf_levels <- c(0.1, 0.2, 0.3)
grid_params <- expand.grid(supp = support_levels, conf = conf_levels)
grid_params$n_rules <- vapply(seq_len(nrow(grid_params)), function(i) {
  r <- apriori(trans, parameter = list(
    supp = grid_params$supp[i], conf = grid_params$conf[i], target = "rules"
  ))
  length(r)
}, integer(1))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.1    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 964 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[48452 item(s), 964462 transaction(s)] done [1.94s].
## sorting and recoding items ... [1773 item(s)] done [0.06s].
## creating transaction tree ... done [0.47s].
## checking subsets of size 1 2 3 4 done [0.17s].
## writing ... [1838 rule(s)] done [0.00s].
## creating S4 object  ... done [0.16s].
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.1    0.1    1 none FALSE            TRUE       5   0.002      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 1928 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[48452 item(s), 964462 transaction(s)] done [1.67s].
## sorting and recoding items ... [833 item(s)] done [0.05s].
## creating transaction tree ... done [0.42s].
## checking subsets of size 1 2 3 4 done [0.12s].
## writing ... [603 rule(s)] done [0.00s].
## creating S4 object  ... done [0.15s].
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.1    0.1    1 none FALSE            TRUE       5   0.005      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 4822 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[48452 item(s), 964462 transaction(s)] done [1.54s].
## sorting and recoding items ... [252 item(s)] done [0.04s].
## creating transaction tree ... done [0.37s].
## checking subsets of size 1 2 3 done [0.05s].
## writing ... [95 rule(s)] done [0.00s].
## creating S4 object  ... done [0.15s].
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 964 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[48452 item(s), 964462 transaction(s)] done [1.75s].
## sorting and recoding items ... [1773 item(s)] done [0.06s].
## creating transaction tree ... done [0.40s].
## checking subsets of size 1 2 3 4 done [0.17s].
## writing ... [767 rule(s)] done [0.00s].
## creating S4 object  ... done [0.18s].
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5   0.002      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 1928 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[48452 item(s), 964462 transaction(s)] done [1.60s].
## sorting and recoding items ... [833 item(s)] done [0.05s].
## creating transaction tree ... done [0.40s].
## checking subsets of size 1 2 3 4 done [0.12s].
## writing ... [220 rule(s)] done [0.00s].
## creating S4 object  ... done [0.15s].
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5   0.005      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 4822 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[48452 item(s), 964462 transaction(s)] done [1.59s].
## sorting and recoding items ... [252 item(s)] done [0.04s].
## creating transaction tree ... done [0.36s].
## checking subsets of size 1 2 3 done [0.05s].
## writing ... [37 rule(s)] done [0.00s].
## creating S4 object  ... done [0.18s].
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 964 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[48452 item(s), 964462 transaction(s)] done [1.76s].
## sorting and recoding items ... [1773 item(s)] done [0.06s].
## creating transaction tree ... done [0.40s].
## checking subsets of size 1 2 3 4 done [0.16s].
## writing ... [176 rule(s)] done [0.00s].
## creating S4 object  ... done [0.15s].
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5   0.002      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 1928 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[48452 item(s), 964462 transaction(s)] done [1.90s].
## sorting and recoding items ... [833 item(s)] done [0.05s].
## creating transaction tree ... done [0.39s].
## checking subsets of size 1 2 3 4 done [0.12s].
## writing ... [44 rule(s)] done [0.00s].
## creating S4 object  ... done [0.16s].
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5   0.005      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 4822 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[48452 item(s), 964462 transaction(s)] done [1.68s].
## sorting and recoding items ... [252 item(s)] done [0.04s].
## creating transaction tree ... done [0.35s].
## checking subsets of size 1 2 3 done [0.05s].
## writing ... [5 rule(s)] done [0.00s].
## creating S4 object  ... done [0.16s].
knitr::kable(grid_params, digits = 2, format = "html",
             col.names = c("Support", "Confidence", "Number of Rules")) %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    font_size = 12
  )
Support Confidence Number of Rules
0 0.1 1838
0 0.1 603
0 0.1 95
0 0.2 767
0 0.2 220
0 0.2 37
0 0.3 176
0 0.3 44
0 0.3 5
# Heatmap: rule count vs supp, conf
ggplot(grid_params, aes(x = factor(supp), y = factor(conf), fill = n_rules)) +
  geom_tile() +
  geom_text(aes(label = n_rules), color = "white", size = 4) +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(title = "Parameter Sensitivity: Support × Confidence → Rule Count",
       x = "Minimum Support", y = "Minimum Confidence", fill = "Rules") +
  theme_minimal()

3.8 Multi-level (Department) Analysis

To elevate the analysis from tactical item pairing to strategic category management, I aggregated specific products into their parent Department dimension (e.g., mapping “Organic Banana” to “Produce”). At this macro scale, noise from individual product preferences is smoothed out, revealing stable, high-value cross-category associations critical for store layout and promotional planning.

3.8.1 Empirical Observations & Insights:

The Apriori algorithm (Supp=0.01, Conf=0.2) generated 12,038 valid rules. The top rules exhibit a distinct “Pantry Stock-up Pattern” characterized by complex antecedents:

  1. The “Dry Goods” Staple Effect: The top rules (Rules #1-4, #6-10) consistently identify {dry goods pasta} as the consequent for complex baskets containing 5-6 diverse categories (e.g., Bakery + Canned Goods + Dairy + Frozen + Snacks).

Interpretation: With confidence levels exceeding 50% (Conf > 0.5) and Lift > 2.8, this indicates that “Dry Goods/Pasta” acts as a universal “basket completer”. When a customer engages in a “major shopping trip” (buying across fresh, frozen, and pantry categories), pasta is the staple most likely to be added. It is not driven by a single item, but by the volume of the shop.

  1. The “Non-Food” Essentials Cluster: Rule #5 highlights a different mission: {beverages, pantry, personal care} => {household}.

Interpretation: This links personal consumption (care/beverages) with home maintenance (household), suggesting a “General Essentials” shopping mission distinct from fresh cooking.

The above strong predictability of {dry goods pasta} in large baskets suggests it should be positioned in high-traffic transition zones (e.g., between Dairy and Frozen aisles) rather than isolated, to capture these “stock-up” shoppers efficiently.

# Use data_dir from setup
departments_path <- file.path(data_dir, "departments.csv")

if (file.exists(departments_path) && "department_id" %in% names(products)) {
  departments <- fread(departments_path)
  # Order–department (deduplicate departments per order)
  order_dept <- merge(
    data_sub[, .(order_id, product_id)],
    products[, .(product_id, department_id)],
    by = "product_id"
  )
  order_dept <- merge(order_dept, departments, by = "department_id")
  order_dept <- unique(order_dept[, .(order_id, department)])

  trans_dept_list <- split(order_dept$department, order_dept$order_id)
  trans_dept <- as(trans_dept_list, "transactions")

  # Fewer departments; use higher support
  rules_dept <- apriori(trans_dept, parameter = list(supp = 0.01, conf = 0.2, target = "rules"))
  rules_dept <- rules_dept[!is.redundant(rules_dept)]
  rules_dept_sorted <- sort(rules_dept, by = "lift", decreasing = TRUE)
  cat("Department-level rules:", length(rules_dept_sorted), "\n")
  inspect(head(rules_dept_sorted, 10))
} else {
  cat("departments.csv not found or products lack department_id; skipping multi-level analysis.\n")
}
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9644 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[21 item(s), 964462 transaction(s)] done [0.13s].
## sorting and recoding items ... [21 item(s)] done [0.03s].
## creating transaction tree ... done [0.40s].
## checking subsets of size 1 2 3 4 5 6 7 8 done [0.09s].
## writing ... [12836 rule(s)] done [0.00s].
## creating S4 object  ... done [0.08s].
## Department-level rules: 12038 
##      lhs                rhs                  support confidence   coverage     lift count
## [1]  {bakery,                                                                            
##       canned goods,                                                                      
##       dairy eggs,                                                                        
##       frozen,                                                                            
##       pantry,                                                                            
##       snacks}        => {dry goods pasta} 0.01034152  0.5335402 0.01938283 2.876657  9974
## [2]  {bakery,                                                                            
##       canned goods,                                                                      
##       frozen,                                                                            
##       pantry,                                                                            
##       produce,                                                                           
##       snacks}        => {dry goods pasta} 0.01026686  0.5321653 0.01929262 2.869244  9902
## [3]  {bakery,                                                                            
##       canned goods,                                                                      
##       frozen,                                                                            
##       pantry,                                                                            
##       snacks}        => {dry goods pasta} 0.01085268  0.5264031 0.02061668 2.838176 10467
## [4]  {beverages,                                                                         
##       canned goods,                                                                      
##       dairy eggs,                                                                        
##       frozen,                                                                            
##       pantry,                                                                            
##       produce,                                                                           
##       snacks}        => {dry goods pasta} 0.01068886  0.5198165 0.02056276 2.802663 10309
## [5]  {beverages,                                                                         
##       pantry,                                                                            
##       personal care} => {household}       0.01061939  0.4100737 0.02589630 2.795471 10242
## [6]  {canned goods,                                                                      
##       dairy eggs,                                                                        
##       frozen,                                                                            
##       meat seafood,                                                                      
##       snacks}        => {dry goods pasta} 0.01031767  0.5183622 0.01990436 2.794823  9951
## [7]  {beverages,                                                                         
##       canned goods,                                                                      
##       dairy eggs,                                                                        
##       frozen,                                                                            
##       pantry,                                                                            
##       snacks}        => {dry goods pasta} 0.01136385  0.5147957 0.02207448 2.775593 10960
## [8]  {canned goods,                                                                      
##       dairy eggs,                                                                        
##       deli,                                                                              
##       frozen,                                                                            
##       pantry,                                                                            
##       produce}       => {dry goods pasta} 0.01120728  0.5140534 0.02180179 2.771591 10809
## [9]  {bakery,                                                                            
##       canned goods,                                                                      
##       dairy eggs,                                                                        
##       frozen,                                                                            
##       pantry,                                                                            
##       produce}       => {dry goods pasta} 0.01292016  0.5129672 0.02518710 2.765735 12461
## [10] {canned goods,                                                                      
##       frozen,                                                                            
##       meat seafood,                                                                      
##       pantry,                                                                            
##       produce}       => {dry goods pasta} 0.01002320  0.5126750 0.01955080 2.764159  9667

4. FP-Growth Analysis

4.1 Performance Optimization: FP-Growth Algorithm

In this subsection I introduce the FP-Growth (Frequent Pattern Growth) algorithm as an alternative to Apriori, using the fast C backend via fim4r to mine rules on the same data and compare runtime.

# =========================================================
# High-Performance Mining with FP-Growth
# =========================================================

# Ensure packages


# Parameters (tune if needed)
fpg_supp <- 0.01
fpg_conf <- 0.25

cat("Running FP-Growth algorithm (C-implementation via fim4r)...\n")
## Running FP-Growth algorithm (C-implementation via fim4r)...
cat(sprintf("Parameters: Support = %.3f, Confidence = %.2f\n", fpg_supp, fpg_conf))
## Parameters: Support = 0.010, Confidence = 0.25
time_fpg <- system.time({
  fpg_rules <- arules::fim4r(
    trans,
    method = "fpgrowth",
    target = "rules",
    supp = fpg_supp,
    conf = fpg_conf
  )
})
## fim4r.fpgrowth 
## 
## Parameter specification:
##  supp conf target report
##     1   25  rules    scl
## 
## Data size: 964462 transactions and 48452 items 
## Result: 7 rules
# Optional: remove empty LHS rules
if (length(fpg_rules) > 0) {
  fpg_rules <- fpg_rules[size(lhs(fpg_rules)) > 0]
}

cat("\n[FP-Growth Results]\n")
## 
## [FP-Growth Results]
cat("Execution Time:", time_fpg["elapsed"], "seconds\n")
## Execution Time: 0.882 seconds
cat("Rules Generated:", length(fpg_rules), "\n")
## Rules Generated: 7
# =========================================================
# Performance Comparison: Apriori vs. FP-Growth
# =========================================================

# Assuming 'rules' is the object from Section 3.5 (Apriori after redundant pruning)
cat("--- Algorithm Comparison ---\n")
## --- Algorithm Comparison ---
# Create comparison table
algo_comparison <- data.frame(
  Algorithm = c("Apriori", "FP-Growth"),
  Rules_Found = c(length(rules), length(fpg_rules)),
  Execution_Time_sec = c(NA, time_fpg["elapsed"])
)

knitr::kable(algo_comparison, digits = 2, format = "html",
             col.names = c("Algorithm", "Rules Found", "Execution Time (seconds)")) %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    font_size = 12
  )
Algorithm Rules Found Execution Time (seconds)
Apriori 18 NA
elapsed FP-Growth 7 0.88
# Note: Even if rule counts are identical, execution time differs.
# FP-Growth constructs a tree structure (FP-Tree) instead of generating candidates.

4.2 Rule Quality & Consistency Checks

I compare the rule quality and the overlap between Apriori and FP-Growth results.

From the quality summaries, I observe that FP-Growth shows slightly higher average confidence (Mean: 0.309 vs 0.293) but lower average and maximum lift compared with Apriori (Max Lift: 2.57 vs 5.79). This suggests that while FP-Growth identifies highly reliable rules (higher confidence), Apriori is more effective at capturing rare but high-strength associations that may be filtered out by the higher support threshold in FP-Growth.

For rule overlap, the intersection size is 7, and the exclusive rules are mainly in Apriori (11 rules in Apriori vs 0 in FP-Growth). This means FP-Growth confirms a subset of Apriori patterns, specifically those with higher support levels. The absence of exclusive rules in FP-Growth suggests it provides a more conservative, high-frequency set of rules, whereas Apriori’s broader range helps explain its practical value in discovering “hidden gems” or niche associations with lower support but higher lift.

# =========================================================
# Rule Quality Comparison (Lift/Confidence)
# =========================================================

# Safety checks
stopifnot(exists("rules"), exists("fpg_rules"))

# Extract quality metrics
apriori_q <- quality(rules)[, c("support", "confidence", "lift")]
fpg_q <- quality(fpg_rules)[, c("support", "confidence", "lift")]

# Summary statistics
cat("[Apriori Quality Summary]\n")
## [Apriori Quality Summary]
apriori_summary_df <- data.frame(
  Metric = c("Support", "Confidence", "Lift"),
  Min = c(min(apriori_q$support), min(apriori_q$confidence), min(apriori_q$lift)),
  Mean = c(mean(apriori_q$support), mean(apriori_q$confidence), mean(apriori_q$lift)),
  Max = c(max(apriori_q$support), max(apriori_q$confidence), max(apriori_q$lift))
)
formattable::formattable(apriori_summary_df)
Metric Min Mean Max
Support 0.0051044 0.009380302 0.01954561
Confidence 0.2527340 0.293323569 0.37895907
Lift 1.7420349 2.400580954 5.78901715
cat("\n[FP-Growth Quality Summary]\n")
## 
## [FP-Growth Quality Summary]
fpg_summary_df <- data.frame(
  Metric = c("Support", "Confidence", "Lift"),
  Min = c(min(fpg_q$support), min(fpg_q$confidence), min(fpg_q$lift)),
  Mean = c(mean(fpg_q$support), mean(fpg_q$confidence), mean(fpg_q$lift)),
  Max = c(max(fpg_q$support), max(fpg_q$confidence), max(fpg_q$lift))
)
formattable::formattable(fpg_summary_df)
Metric Min Mean Max
Support 0.01016214 0.01363188 0.01954561
Confidence 0.26787232 0.30920496 0.37895907
Lift 1.81674944 2.24012537 2.57015611
# Compare top-k by lift
k <- 10
cat(sprintf("\nTop-%d rules by lift (Apriori):\n", k))
## 
## Top-10 rules by lift (Apriori):
inspect(head(sort(rules, by = "lift"), k))
##      lhs                                       rhs                          support confidence   coverage     lift count kulczynski certainty
## [1]  {Organic Cilantro}                     => {Limes}                  0.005439302  0.2527340 0.02152184 5.789017  5246  0.1886622 0.2186210
## [2]  {Organic Large Extra Fancy Fuji Apple} => {Bag of Organic Bananas} 0.007272448  0.3089323 0.02354059 2.619325  7014  0.1852964 0.2165268
## [3]  {Organic Fuji Apple}                   => {Banana}                 0.010606950  0.3789591 0.02798970 2.570156 10230  0.2254485 0.2715524
## [4]  {Organic Raspberries}                  => {Bag of Organic Bananas} 0.012699308  0.2966264 0.04281247 2.514988 12248  0.2021496 0.2025753
## [5]  {Organic Hass Avocado}                 => {Bag of Organic Bananas} 0.019545612  0.2936293 0.06656561 2.489576 18851  0.2296747 0.1991775
## [6]  {Honeycrisp Apple}                     => {Banana}                 0.008993615  0.3606953 0.02493411 2.446288  8674  0.2108456 0.2501300
## [7]  {Apple Honeycrisp Organic}             => {Bag of Organic Bananas} 0.007302517  0.2789970 0.02617418 2.365514  7043  0.1704562 0.1825887
## [8]  {Organic Cucumber}                     => {Bag of Organic Bananas} 0.006616124  0.2674126 0.02474125 2.267295  6381  0.1617542 0.1694553
## [9]  {Cucumber Kirby}                       => {Banana}                 0.010162142  0.3334921 0.03047191 2.261793  9801  0.2012066 0.2182222
## [10] {Organic Gala Apples}                  => {Bag of Organic Bananas} 0.005886183  0.2592593 0.02270385 2.198165  5677  0.1545830 0.1602117
cat("\n=== Apriori Top Rules (Table View) ===\n")
## 
## === Apriori Top Rules (Table View) ===
apriori_top_df <- as(head(sort(rules, by = "lift"), k), "data.frame")
knitr::kable(apriori_top_df, digits = 2, format = "html") %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    font_size = 12
  ) %>%
  kableExtra::scroll_box(width = "100%", height = "300px")
rules support confidence coverage lift count kulczynski certainty
4 {Organic Cilantro} => {Limes} 0.01 0.25 0.02 5.79 5246 0.19 0.22
6 {Organic Large Extra Fancy Fuji Apple} => {Bag of Organic Bananas} 0.01 0.31 0.02 2.62 7014 0.19 0.22
9 {Organic Fuji Apple} => {Banana} 0.01 0.38 0.03 2.57 10230 0.23 0.27
15 {Organic Raspberries} => {Bag of Organic Bananas} 0.01 0.30 0.04 2.51 12248 0.20 0.20
18 {Organic Hass Avocado} => {Bag of Organic Bananas} 0.02 0.29 0.07 2.49 18851 0.23 0.20
7 {Honeycrisp Apple} => {Banana} 0.01 0.36 0.02 2.45 8674 0.21 0.25
10 {Apple Honeycrisp Organic} => {Bag of Organic Bananas} 0.01 0.28 0.03 2.37 7043 0.17 0.18
11 {Organic Cucumber} => {Bag of Organic Bananas} 0.01 0.27 0.02 2.27 6381 0.16 0.17
13 {Cucumber Kirby} => {Banana} 0.01 0.33 0.03 2.26 9801 0.20 0.22
3 {Organic Gala Apples} => {Bag of Organic Bananas} 0.01 0.26 0.02 2.20 5677 0.15 0.16
cat(sprintf("\nTop-%d rules by lift (FP-Growth):\n", k))
## 
## Top-10 rules by lift (FP-Growth):
inspect(head(sort(fpg_rules, by = "lift"), k))
##     lhs                       rhs                      support    confidence
## [1] {Organic Fuji Apple}   => {Banana}                 0.01060695 0.3789591 
## [2] {Organic Raspberries}  => {Bag of Organic Bananas} 0.01269931 0.2966264 
## [3] {Organic Hass Avocado} => {Bag of Organic Bananas} 0.01954561 0.2936293 
## [4] {Cucumber Kirby}       => {Banana}                 0.01016214 0.3334921 
## [5] {Organic Avocado}      => {Banana}                 0.01682907 0.3058241 
## [6] {Strawberries}         => {Banana}                 0.01283306 0.2880315 
## [7] {Large Lemon}          => {Banana}                 0.01274700 0.2678723 
##     lift     count
## [1] 2.570156 10230
## [2] 2.514988 12248
## [3] 2.489576 18851
## [4] 2.261793  9801
## [5] 2.074144 16231
## [6] 1.953472 12377
## [7] 1.816749 12294
cat("\n=== FP-Growth Top Rules (Table View) ===\n")
## 
## === FP-Growth Top Rules (Table View) ===
fpg_top_df <- as(head(sort(fpg_rules, by = "lift"), k), "data.frame")
knitr::kable(fpg_top_df, digits = 2, format = "html") %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    font_size = 12
  ) %>%
  kableExtra::scroll_box(width = "100%", height = "300px")
rules support confidence lift count
16 {Organic Fuji Apple} => {Banana} 0.01 0.38 2.57 10230
7 {Organic Raspberries} => {Bag of Organic Bananas} 0.01 0.30 2.51 12248
2 {Organic Hass Avocado} => {Bag of Organic Bananas} 0.02 0.29 2.49 18851
12 {Cucumber Kirby} => {Banana} 0.01 0.33 2.26 9801
5 {Organic Avocado} => {Banana} 0.02 0.31 2.07 16231
11 {Strawberries} => {Banana} 0.01 0.29 1.95 12377
6 {Large Lemon} => {Banana} 0.01 0.27 1.82 12294
# =========================================================
#  Rule Overlap (Apriori vs FP-Growth)
# =========================================================

# Convert rules to comparable string form
apriori_rules_chr <- as(rules, "data.frame")$rules
fpg_rules_chr <- as(fpg_rules, "data.frame")$rules

# Overlap stats
overlap <- intersect(apriori_rules_chr, fpg_rules_chr)
only_apriori <- setdiff(apriori_rules_chr, fpg_rules_chr)
only_fpg <- setdiff(fpg_rules_chr, apriori_rules_chr)

cat("[Rule Overlap Summary]
")
## [Rule Overlap Summary]
cat("Overlap rules:", length(overlap), "\n")
## Overlap rules: 7
cat("Only Apriori rules:", length(only_apriori), "\n")
## Only Apriori rules: 11
cat("Only FP-Growth rules:", length(only_fpg), "\n")
## Only FP-Growth rules: 0
# Optional: show a few exclusive rules
cat("\nExamples only in Apriori (up to 5):\n")
## 
## Examples only in Apriori (up to 5):
if (length(only_apriori) > 0) {
  formattable::formattable(data.frame(Rules = head(only_apriori, 5)))
}
Rules
{Blueberries} => {Banana}
{Original Hummus} => {Banana}
{Organic Gala Apples} => {Bag of Organic Bananas}
{Organic Cilantro} => {Limes}
{Yellow Onions} => {Banana}
cat("\nExamples only in FP-Growth (up to 5):\n")
## 
## Examples only in FP-Growth (up to 5):
if (length(only_fpg) > 0) {
  formattable::formattable(data.frame(Rules = head(only_fpg, 5)))
}

4.3 Analysis of Algorithm Efficiency

In the previous section, I introduced FP-Growth to solve the speed problem of large transaction data.

Why FP-Growth? Apriori uses a “generate-and-test” process and needs many database scans. FP-Growth uses a divide-and-conquer idea. It compresses data into an FP-Tree and mines frequent itemsets from the tree, so it can run faster, especially at low support.

Observations:

  1. Speed: The fim4r (C backend) runs much faster than standard Apriori.
  2. Result Consistency: FP-Growth finds high-lift rules (e.g., organic produce + fruit) that are similar to Apriori, so the results are reliable.
  3. Scalability: For very large datasets, FP-Growth is more suitable because it avoids generating too many candidates.

5. Additional Experiments

In this section, I implement four additional optimization experiments:

  1. sequential pattern mining

  2. multi-level association rules

  3. statistical significance tests for rules

  4. added value / improvement filtering.

These four additions have clear purposes: sequential mining captures time order so I can study repeat-purchase chains, multi-level rules reduce sparsity by moving from products to aisles/departments, statistical tests filter out rules that might appear by chance, and added-value/improvement checks remove rules that do not add predictive power beyond item popularity. In short, they help turn raw association rules into time-aware, level-aware, and statistically reliable insights. I also add small limits when needed to control runtime.

5.1 Sequential Pattern Mining (arulesSequences + cSPADE)

In this section, I aim to use the purchase history to find sequential patterns (such as “buy A, then later buy B”), which cannot be captured by standard same-basket rules. I use the orders.csv file combined with the sampled product data, setting user_id as the sequence ID and order_number as the event ID. To make the sequence clearer and the processing faster, I reduce each order to a single representative item before running the cspade() algorithm.

The results show that the algorithm processed 44,398 events and identified 895 sequential patterns. However, the rule induction step only produced 4 rules, and all of them followed an A → A pattern, such as {0% Greek Strained Yogurt} => {0% Greek Strained Yogurt} and {Sparkling Water} => {Sparkling Water}. This indicates that the current model successfully captures strong repurchase behavior, but it does not yet reveal cross-product sequences. This limitation is likely because I restricted the event construction to keep only one item per order.

# Sequential Pattern Mining (cSPADE)
if (!requireNamespace("arulesSequences", quietly = TRUE)) {
  cat("Package arulesSequences not found. Please install it first.
")
} else {
  library(arulesSequences)

  orders_path <- file.path(data_dir, "orders.csv")
  if (file.exists(orders_path)) {
    orders <- fread(orders_path, select = c("order_id", "user_id", "order_number"))

    # Merge order info into sampled data
    seq_data <- merge(data_sub[, .(order_id, product_name)], orders, by = "order_id")

    # Speed-first controls (but keep enough events for sequential rules)
    set.seed(2026)
    max_seq_orders <- 30000
    max_orders_per_user <- 12

    seq_order_ids <- unique(seq_data$order_id)
    if (length(seq_order_ids) > max_seq_orders) {
      seq_order_ids <- sample(seq_order_ids, max_seq_orders)
      seq_data <- seq_data[order_id %in% seq_order_ids]
    }

    # Sort and keep only the first N orders per user
    seq_data <- seq_data[order(user_id, order_number)]
    seq_data <- seq_data[, .SD[order(order_number)][1:max_orders_per_user], by = user_id]

    # Create per-user event index (1..k) to ensure valid eventID order
    seq_data[, event_index := data.table::frank(order_number, ties.method = "dense"), by = user_id]

    # Build event id (user + event_index)
    seq_data[, event_id := paste(user_id, event_index, sep = "_")]

    # Reduce each event to a single representative item (most frequent in dataset)
    item_freq_all <- data_sub[, .N, by = product_name]
    setorder(item_freq_all, -N)
    freq_rank <- setNames(seq_len(nrow(item_freq_all)), item_freq_all$product_name)
    seq_data[, item_rank := freq_rank[product_name]]
    seq_data <- seq_data[order(event_id, item_rank)]
    seq_data <- seq_data[, .SD[1], by = event_id]

    # Split items by event (order)
    trans_list <- split(seq_data$product_name, seq_data$event_id)
    trans_seq <- as(trans_list, "transactions")

    # Map event_id -> (user_id, event_index) and enforce integer IDs
    event_map <- unique(seq_data[, .(event_id, user_id, event_index)])
    event_ids <- names(trans_list)
    event_map <- event_map[match(event_ids, event_map$event_id)]
    transactionInfo(trans_seq)$sequenceID <- as.integer(factor(event_map$user_id))
    transactionInfo(trans_seq)$eventID <- as.integer(event_map$event_index)

    # Ensure transactions are ordered by (sequenceID, eventID) for cSPADE
    ord <- order(transactionInfo(trans_seq)$sequenceID, transactionInfo(trans_seq)$eventID)
    trans_seq <- trans_seq[ord]

    cat("Sequential transactions created:", length(trans_seq), "events.
")

    # Run cSPADE on sequences (single-item events => clearer sequences)
    seq_patterns <- cspade(
      trans_seq,
      parameter = list(support = 0.0001, maxlen = 4, maxsize = 1),
      control = list(verbose = FALSE)
    )

    cat("Sequential patterns found:", length(seq_patterns), "\n")
    if (length(seq_patterns) > 0) {
      inspect(head(sort(seq_patterns, by = "support", decreasing = TRUE), 10))
      
      # Add formatted table view
      cat("\n=== Top 10 Sequential Patterns (Table View) ===\n")
      seq_pat_df <- as(head(sort(seq_patterns, by = "support", decreasing = TRUE), 10), "data.frame")
      formattable::formattable(seq_pat_df)
    }

    # Use full pattern set for rule induction, then filter rules if needed
    seq_rules_all <- tryCatch(ruleInduction(seq_patterns, confidence = 0.05), error = function(e) NULL)

    if (!is.null(seq_rules_all)) {
      # Optional: filter rules by length or confidence after induction
      seq_rules <- seq_rules_all[size(lhs(seq_rules_all)) >= 1]
      cat("Sequential rules found:", length(seq_rules), "\n")
      if (length(seq_rules) > 0) {
        inspect(head(sort(seq_rules, by = "confidence", decreasing = TRUE), 10))
        
        # Add formatted table view
        cat("\n=== Top 10 Sequential Rules (Table View) ===\n")
        seq_rules_df <- as(head(sort(seq_rules, by = "confidence", decreasing = TRUE), 10), "data.frame")
        formattable::formattable(seq_rules_df)
      }
    } else {
      cat("ruleInduction failed; consider lowering support or confidence.
")
    }
  } else {
    cat("orders.csv not found; skipping sequential pattern mining.
")
  }
}
## Sequential transactions created: 44398 events.
## Sequential patterns found: 895 
##     items                         support 
##   1 <{Banana}>                 0.13779452 
##   2 <{Bag of Organic Bananas}> 0.12057501 
##   3 <{Organic Strawberries}>   0.04769958 
##   4 <{Organic Baby Spinach}>   0.03774455 
##   5 <{Strawberries}>           0.02482992 
##   6 <{Organic Hass Avocado}>   0.01760387 
##   7 <{Limes}>                  0.01691202 
##   8 <{Organic Avocado}>        0.01499020 
##   9 <{Organic Whole Milk}>     0.01425991 
##  10 <{Large Lemon}>            0.01249183 
##  
## 
## === Top 10 Sequential Patterns (Table View) ===
## Sequential rules found: 4 
##    lhs                             rhs                               support confidence      lift 
##  1 <{0% Greek Strained Yogurt}> => <{0% Greek Strained Yogurt}> 0.0001537456 0.13793103 123.74316 
##  2 <{Sparkling Water}>          => <{Sparkling Water}>          0.0001153092 0.10714286  99.55485 
##  3 <{Sparkling Mineral Water}>  => <{Sparkling Mineral Water}>  0.0001153092 0.06521739  36.88611 
##  4 <{Reduced Fat 2% Milk}>      => <{Reduced Fat 2% Milk}>      0.0001153092 0.06000000  31.22040 
##  
## 
## === Top 10 Sequential Rules (Table View) ===
rule support confidence lift
62 <{0% Greek Strained Yogurt}> => <{0% Greek Strained Yogurt}> 0.0001537456 0.13793103 123.74316
7 <{Sparkling Water}> => <{Sparkling Water}> 0.0001153092 0.10714286 99.55485
8 <{Sparkling Mineral Water}> => <{Sparkling Mineral Water}> 0.0001153092 0.06521739 36.88611
15 <{Reduced Fat 2% Milk}> => <{Reduced Fat 2% Milk}> 0.0001153092 0.06000000 31.22040

The bar chart below displays the most frequent repeat-purchase sequences identified in the dataset. The x-axis represents the Support level, which indicates the proportion of transactions or users that contain a specific sequential pattern, while the y-axis lists the top 10 individual items or sequences.

According to the visualization, everyday grocery staples dominate the top patterns. Bananas and Bag of Organic Bananas exhibit the highest support, significantly outperforming other items, which reflects their role as high-frequency “anchor” products in consumer shopping habits. Other organic items, such as Organic Strawberries, Organic Baby Spinach, and Organic Hass Avocado, also rank prominently. This distribution suggests that the most common sequential behaviors are centered around fresh produce and organic essentials. By identifying these high-support patterns, we can better understand the core products that drive recurring traffic and use these insights to optimize inventory placement or loyalty promotions.

# Visualization: Top sequential patterns by support
if (exists("seq_patterns") && length(seq_patterns) > 0) {
  df_seq <- as(seq_patterns, "data.frame")

  # Normalize column names to items/support if needed
  if (!"items" %in% names(df_seq)) {
    if ("sequence" %in% names(df_seq)) df_seq$items <- df_seq$sequence
    if ("pattern" %in% names(df_seq)) df_seq$items <- df_seq$pattern
  }

  if (!"support" %in% names(df_seq)) {
    cat("seq_patterns data frame missing support column; skip 5.1 plot.
")
  } else if (!"items" %in% names(df_seq)) {
    cat("seq_patterns data frame missing items/sequence column; skip 5.1 plot.
")
  } else {
    df_seq$items <- as.character(df_seq$items)
    df_seq <- df_seq[order(-df_seq$support), ]
    top_seq <- head(df_seq, 10)
    if (nrow(top_seq) == 0) {
      cat("No patterns to plot in 5.1.
")
    } else {
      top_seq$items <- factor(top_seq$items, levels = rev(unique(top_seq$items)))
      ggplot(top_seq, aes(x = items, y = support)) +
        geom_col(fill = "steelblue") +
        coord_flip() +
        labs(title = "Top 10 Sequential Patterns (Support)", x = "Pattern", y = "Support") +
        theme_minimal()
    }
  }
} else {
  cat("seq_patterns not found; skip 5.1 plot.
")
}

5.2 Multi-level Association Rules (Product / Aisle / Department)

In this section, I aimed to expand the analysis beyond simple product associations by incorporating aisle and department information. Using the aisles.csv and departments.csv files, I first examined aisle-level transactions to improve rule stability. Then, I built a mixed-level dataset using tags— P: for product, A: for aisle, and D: for department—to capture interactions across different hierarchy levels.

The results show distinct patterns at each level. The aisle-level mining generated 5,834 rules and successfully identified complementary meal preparation categories. A strong example is the relationship between {pasta sauce} and {dry pasta}, which had a Lift of 4.42. For the mixed-level analysis, the algorithm produced 85,756 rules. While the highest-ranked rules mostly reflected obvious structural hierarchies (e.g., an ‘Other’ aisle implies the ‘Other’ department), meaningful cross-level behaviors also emerged. Specifically, combinations like {A:fresh vegetables, A:packaged poultry} proved to be strong predictors for purchasing the specific product {P:Boneless Skinless Chicken Breasts}, achieving a high Lift of 31.86.

# Multi-level Association Rules: Aisle + Mixed Level

aisles_path <- file.path(data_dir, "aisles.csv")
departments_path <- file.path(data_dir, "departments.csv")

if (file.exists(aisles_path) && file.exists(departments_path)) {
  aisles <- fread(aisles_path)
  departments <- fread(departments_path)

  # Aisle-level transactions
  order_aisle <- merge(
    data_sub[, .(order_id, product_id)],
    products[, .(product_id, aisle_id)],
    by = "product_id"
  )
  order_aisle <- merge(order_aisle, aisles, by = "aisle_id")
  order_aisle <- unique(order_aisle[, .(order_id, aisle)])

  trans_aisle_list <- split(order_aisle$aisle, order_aisle$order_id)
  trans_aisle <- as(trans_aisle_list, "transactions")

  rules_aisle <- apriori(trans_aisle, parameter = list(supp = 0.01, conf = 0.2, target = "rules"))
  rules_aisle <- rules_aisle[!is.redundant(rules_aisle)]
  rules_aisle_sorted <- sort(rules_aisle, by = "lift", decreasing = TRUE)

  cat("Aisle-level rules:", length(rules_aisle_sorted), "\n")
  if (length(rules_aisle_sorted) > 0) {
    inspect(head(rules_aisle_sorted, 10))
    
    # Add formatted table view
    cat("\n=== Top 10 Aisle-level Rules (Table View) ===\n")
    aisle_rules_df <- as(head(rules_aisle_sorted, 10), "data.frame")
    knitr::kable(aisle_rules_df, digits = 2, format = "html") %>%
      kableExtra::kable_styling(
        bootstrap_options = c("striped", "hover", "condensed"),
        full_width = FALSE,
        font_size = 11
      ) %>%
      kableExtra::scroll_box(width = "100%", height = "300px")
  }

  # Mixed-level transactions (Product + Aisle + Department)
  data_lvl <- merge(data_sub, products[, .(product_id, aisle_id, department_id)], by = "product_id")
  data_lvl <- merge(data_lvl, aisles, by = "aisle_id")
  data_lvl <- merge(data_lvl, departments, by = "department_id")

  data_lvl[, item_product := paste0("P:", product_name)]
  data_lvl[, item_aisle := paste0("A:", aisle)]
  data_lvl[, item_dept := paste0("D:", department)]

  # Build mixed item list per order (unique items)
  mixed_list <- data_lvl[, .(items = unique(c(item_product, item_aisle, item_dept))), by = order_id]
  trans_mixed <- as(split(mixed_list$items, mixed_list$order_id), "transactions")

  rules_mixed <- apriori(trans_mixed, parameter = list(supp = 0.01, conf = 0.2, target = "rules"))
  rules_mixed <- rules_mixed[!is.redundant(rules_mixed)]
  rules_mixed_sorted <- sort(rules_mixed, by = "lift", decreasing = TRUE)

  cat("Mixed-level rules:", length(rules_mixed_sorted), "\n")
  if (length(rules_mixed_sorted) > 0) {
    inspect(head(rules_mixed_sorted, 10))
    
    # Add formatted table view
    cat("\n=== Top 10 Mixed-level Rules (Table View) ===\n")
    mixed_rules_df <- as(head(rules_mixed_sorted, 10), "data.frame")
    knitr::kable(mixed_rules_df, digits = 2, format = "html") %>%
      kableExtra::kable_styling(
        bootstrap_options = c("striped", "hover", "condensed"),
        full_width = FALSE,
        font_size = 11
      ) %>%
      kableExtra::scroll_box(width = "100%", height = "300px")
  }
} else {
  cat("aisles.csv / departments.csv not found; skipping multi-level rules.
")
}
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9644 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[134 item(s), 964462 transaction(s)] done [0.24s].
## sorting and recoding items ... [96 item(s)] done [0.05s].
## creating transaction tree ... done [0.39s].
## checking subsets of size 1 2 3 4 5 6 done [0.52s].
## writing ... [6094 rule(s)] done [0.00s].
## creating S4 object  ... done [0.17s].
## Aisle-level rules: 5834 
##      lhs                              rhs                           support confidence   coverage     lift count
## [1]  {dry pasta,                                                                                                
##       packaged vegetables fruits}  => {pasta sauce}              0.01032804  0.2840319 0.03636224 4.601448  9961
## [2]  {packaged vegetables fruits,                                                                               
##       pasta sauce}                 => {dry pasta}                0.01032804  0.3200218 0.03227291 4.536753  9961
## [3]  {fresh vegetables,                                                                                         
##       pasta sauce}                 => {dry pasta}                0.01277603  0.3174055 0.04025146 4.499663 12322
## [4]  {fresh fruits,                                                                                             
##       pasta sauce}                 => {dry pasta}                0.01332142  0.3161028 0.04214267 4.481196 12848
## [5]  {dry pasta,                                                                                                
##       fresh fruits}                => {pasta sauce}              0.01332142  0.2761348 0.04824244 4.473511 12848
## [6]  {pasta sauce}                 => {dry pasta}                0.01923456  0.3116087 0.06172664 4.417485 18551
## [7]  {dry pasta}                   => {pasta sauce}              0.01923456  0.2726765 0.07053985 4.417485 18551
## [8]  {canned jarred vegetables,                                                                                 
##       fresh fruits,                                                                                             
##       fresh vegetables}            => {canned meals beans}       0.01188227  0.2855150 0.04161698 4.097987 11460
## [9]  {canned meals beans,                                                                                       
##       fresh vegetables}            => {canned jarred vegetables} 0.01513901  0.3032084 0.04992939 4.097249 14601
## [10] {canned jarred vegetables,                                                                                 
##       fresh fruits}                => {canned meals beans}       0.01361484  0.2725122 0.04996050 3.911359 13131
## 
## === Top 10 Aisle-level Rules (Table View) ===
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9644 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[48607 item(s), 964462 transaction(s)] done [2.36s].
## sorting and recoding items ... [220 item(s)] done [0.14s].
## creating transaction tree ... done [0.50s].
## checking subsets of size 1 2 3 4 5
##  done [7.39s].
## writing ... [376112 rule(s)] done [0.02s].
## creating S4 object  ... done [0.27s].
## Mixed-level rules: 158444 
##      lhs                      rhs                                      support confidence   coverage     lift count
## [1]  {A:other}             => {D:other}                             0.01089312  1.0000000 0.01089312 91.80107 10506
## [2]  {D:other}             => {A:other}                             0.01089312  1.0000000 0.01089312 91.80107 10506
## [3]  {A:missing}           => {D:missing}                           0.01846625  1.0000000 0.01846625 54.15284 17810
## [4]  {D:missing}           => {A:missing}                           0.01846625  1.0000000 0.01846625 54.15284 17810
## [5]  {A:cat food care}     => {D:pets}                              0.01074174  1.0000000 0.01074174 53.85047 10360
## [6]  {D:pets}              => {A:cat food care}                     0.01074174  0.5784478 0.01856994 53.85047 10360
## [7]  {A:fresh vegetables,                                                                                          
##       A:packaged poultry}  => {P:Boneless Skinless Chicken Breasts} 0.01201499  0.4993536 0.02406108 31.86289 11588
## [8]  {A:fresh fruits,                                                                                              
##       A:packaged poultry}  => {P:Boneless Skinless Chicken Breasts} 0.01167076  0.4849216 0.02406730 30.94201 11256
## [9]  {A:packaged poultry,                                                                                          
##       D:produce}           => {P:Boneless Skinless Chicken Breasts} 0.01465273  0.4602058 0.03183951 29.36494 14132
## [10] {A:packaged poultry,                                                                                          
##       D:dairy eggs}        => {P:Boneless Skinless Chicken Breasts} 0.01265161  0.4402035 0.02874038 28.08862 12202
## 
## === Top 10 Mixed-level Rules (Table View) ===
rules support confidence coverage lift count
44 {A:other} => {D:other} 0.01 1.00 0.01 91.80 10506
45 {D:other} => {A:other} 0.01 1.00 0.01 91.80 10506
257 {A:missing} => {D:missing} 0.02 1.00 0.02 54.15 17810
258 {D:missing} => {A:missing} 0.02 1.00 0.02 54.15 17810
29 {A:cat food care} => {D:pets} 0.01 1.00 0.01 53.85 10360
30 {D:pets} => {A:cat food care} 0.01 0.58 0.02 53.85 10360
2873 {A:fresh vegetables,A:packaged poultry} => {P:Boneless Skinless Chicken Breasts} 0.01 0.50 0.02 31.86 11588
2876 {A:fresh fruits,A:packaged poultry} => {P:Boneless Skinless Chicken Breasts} 0.01 0.48 0.02 30.94 11256
2882 {A:packaged poultry,D:produce} => {P:Boneless Skinless Chicken Breasts} 0.01 0.46 0.03 29.36 14132
2879 {A:packaged poultry,D:dairy eggs} => {P:Boneless Skinless Chicken Breasts} 0.01 0.44 0.03 28.09 12202

Figure below displays the top 10 aisle-level association rules ranked by Lift, visualizing the strongest cross-category relationships in the dataset. The chart reveals two distinct consumption patterns with high Lift values (ranging from approximately 4.0 to 4.6). The most prominent cluster centers on meal preparation, strongly linking ‘dry pasta’ and ‘pasta sauce’ with fresh or packaged vegetables; this effectively captures the purchasing behavior for pasta-based dinners. A secondary cluster appears at the bottom, connecting ‘canned jarred vegetables’ with ‘canned meals beans,’ which reflects a consistent tendency among users to stock up on non-perishable pantry staples in the same trip.

#  Visualization: Top aisle-level rules by lift
if (exists("rules_aisle_sorted") && length(rules_aisle_sorted) > 0) {
  df_aisle <- as(rules_aisle_sorted, "data.frame")
  df_aisle <- df_aisle[order(-df_aisle$lift), ]
  top_aisle <- head(df_aisle, 10)
  ggplot(top_aisle, aes(x = reorder(rules, lift), y = lift)) +
    geom_col(fill = "darkgreen") +
    coord_flip() +
    labs(title = "Top 10 Aisle-level Rules (Lift)", x = "Rule", y = "Lift") +
    theme_minimal()
} else {
  cat("rules_aisle_sorted not found; skip 5.2 plot.
")
}

5.3 Statistical Significance Tests (Fisher / Chi-square)

The primary goal of this analysis is to verify whether high-lift association rules remain statistically significant, particularly for patterns with lower support. To achieve this, I select the top rules based on their Lift values and construct \(2 \times 2\) contingency tables for each. I then perform both Fisher’s Exact Test and the Chi-square test to calculate their respective p-values.

The results indicate that all 18 analyzed rules are statistically significant, with p-values for both tests falling well below the 0.05 threshold. In the output table, many p-values appear as 0, which is likely due to numerical underflow caused by the extremely strong associations in a large sample size. For instance, the rule {Organic Cilantro} => {Limes} shows a high Lift of 5.79 and is highly significant. However, while these p-values confirm that the observed associations are not due to random chance, they do not necessarily reflect the practical importance of each rule. Therefore, we should still rely on metrics like Lift and Added Value for prioritizing the rules in a real-world context, rather than using p-values as the sole ranking tool.

# Statistical Significance Tests (Fisher / Chi-square)

# Use top rules by lift for testing
if (exists("rules_sorted") && length(rules_sorted) > 0) {
  top_n <- min(50, length(rules_sorted))
  rules_test <- head(rules_sorted, top_n)

  N_total <- length(trans)
  supp_ab <- quality(rules_test)$support
  supp_a <- support(lhs(rules_test), trans)
  supp_b <- support(rhs(rules_test), trans)

  n11 <- round(supp_ab * N_total)
  n10 <- round((supp_a - supp_ab) * N_total)
  n01 <- round((supp_b - supp_ab) * N_total)
  n00 <- pmax(0, N_total - n11 - n10 - n01)

  # Guard against negative counts due to rounding
  n10 <- pmax(0, n10)
  n01 <- pmax(0, n01)

  p_fisher <- numeric(length(rules_test))
  p_chisq <- numeric(length(rules_test))

  for (i in seq_along(rules_test)) {
    mat <- matrix(c(n11[i], n10[i], n01[i], n00[i]), nrow = 2, byrow = TRUE)
    p_fisher[i] <- tryCatch(fisher.test(mat, alternative = "greater")$p.value, error = function(e) NA)
    p_chisq[i] <- tryCatch(chisq.test(mat, correct = FALSE)$p.value, error = function(e) NA)
  }

  stats_df <- data.frame(
    rule = labels(rules_test),
    support = supp_ab,
    confidence = quality(rules_test)$confidence,
    lift = quality(rules_test)$lift,
    p_fisher = p_fisher,
    p_chisq = p_chisq
  )

  cat("=== Top rules with p-values (sorted by Fisher) ===\n")
  stats_display <- head(stats_df[order(stats_df$p_fisher), ], 10)
  knitr::kable(stats_display, digits = 2, format = "html",
               col.names = c("Rule", "Support", "Confidence", "Lift", "p-value (Fisher)", "p-value (Chi-square)")) %>%
    kableExtra::kable_styling(
      bootstrap_options = c("striped", "hover", "condensed"),
      full_width = FALSE,
      font_size = 11
    ) %>%
    kableExtra::scroll_box(width = "100%", height = "300px")
  cat("Significant by Fisher (p < 0.05):", sum(stats_df$p_fisher < 0.05, na.rm = TRUE), "/", nrow(stats_df), "\n")
} else {
  cat("rules_sorted not found; skipping statistical tests.\n")
}
## === Top rules with p-values (sorted by Fisher) ===
## Significant by Fisher (p < 0.05): 18 / 18

5.4 Added Value and Improvement Filtering

This analysis aims to identify association rules that provide genuine predictive value rather than just reflecting item popularity or redundant information. To achieve this, I evaluate the rules using two key metrics: Added Value (the difference between confidence and RHS support) and Improvement (which compares a multi-item rule to its best single-item component).

Based on the experimental results, the Apriori algorithm generated 1,243 rules initially, and after sorting by lift and limiting the set to 200 candidates, all 200 rules passed the filters (Added Value > 0 and Improvement > 0). The top-ranked rules are dominated by strong within-brand or within-category bundles, particularly Greek yogurt combinations and sparkling water trios. Since the current filter did not reduce the candidate list, the thresholds appear to be relatively lenient. For a more refined selection in future analyses, one could apply stricter cutoffs, such as requiring an Added Value greater than 0.10 or an Improvement above 0.05, to focus on the most impactful associations.

# 5.4 Added Value / Improvement Filtering

if (exists("trans") && length(trans) > 0) {
  # Local rule set for 5.4 only (do NOT affect main analysis)
  rules_imp_source <- apriori(
    trans,
    parameter = list(supp = 0.001, conf = 0.15, maxlen = 5, minlen = 2, target = "rules")
  )
  rules_imp_source <- rules_imp_source[!is.redundant(rules_imp_source)]
  rules_sorted_local <- sort(rules_imp_source, by = "lift", decreasing = TRUE)

  if (length(rules_sorted_local) == 0) {
    cat("No rules found for 5.4 local analysis.\n")
  } else {
    # Prefer multi-antecedent rules so improvement is meaningful
    rules_multi <- rules_sorted_local[size(lhs(rules_sorted_local)) >= 2]

    # If too few multi-antecedent rules, fall back to all rules
    if (length(rules_multi) >= 10) {
      rules_imp <- rules_multi
    } else {
      rules_imp <- rules_sorted_local
    }

    top_n <- min(200, length(rules_imp))
    rules_imp <- head(rules_imp, top_n)

    # Added Value = confidence - support(RHS)
    supp_rhs <- support(rhs(rules_imp), trans)
    added_value <- quality(rules_imp)$confidence - supp_rhs

    # Build lookup for single-item rules with same RHS (baseline for improvement)
    rules_single <- rules_sorted_local[size(lhs(rules_sorted_local)) == 1 & size(rhs(rules_sorted_local)) == 1]
    if (length(rules_single) > 0) {
      single_df <- as(rules_single, "data.frame")
      single_df$lhs_item <- gsub("[{}]", "", sub(" =>.*", "", single_df$rules))
      single_df$rhs_item <- gsub("[{}]", "", sub(".*=> ", "", single_df$rules))
      single_conf_map <- split(single_df$confidence, paste(single_df$lhs_item, single_df$rhs_item, sep = "|"))
    } else {
      single_conf_map <- list()
    }

    # Improvement: compare to best single-item LHS -> RHS (only for single RHS)
    lhs_list <- as(lhs(rules_imp), "list")
    rhs_list <- as(rhs(rules_imp), "list")
    improvement <- rep(NA_real_, length(rules_imp))

    for (i in seq_along(rules_imp)) {
      lhs_items <- lhs_list[[i]]
      rhs_items <- rhs_list[[i]]
      if (length(lhs_items) <= 1 || length(rhs_items) != 1) {
        improvement[i] <- NA
        next
      }

      rhs_item <- rhs_items[[1]]
      single_confs <- sapply(lhs_items, function(it) {
        key <- paste(it, rhs_item, sep = "|")
        vals <- single_conf_map[[key]]
        if (is.null(vals)) return(NA_real_)
        max(vals, na.rm = TRUE)
      })

      if (all(is.na(single_confs))) {
        improvement[i] <- NA
      } else {
        best_single <- max(single_confs, na.rm = TRUE)
        improvement[i] <- quality(rules_imp)$confidence[i] - best_single
      }
    }

    imp_df <- data.frame(
      rule = labels(rules_imp),
      confidence = quality(rules_imp)$confidence,
      lift = quality(rules_imp)$lift,
      added_value = added_value,
      improvement = improvement
    )

    # Filter: added_value > 0 and (if available) improvement > 0
    imp_filtered <- imp_df[imp_df$added_value > 0 & (is.na(imp_df$improvement) | imp_df$improvement > 0), ]

    cat("=== Added Value / Improvement Filtered Rules ===\n")
    imp_display <- head(imp_filtered[order(-imp_filtered$added_value), ], 10)
    knitr::kable(imp_display, digits = 2, format = "html",
                 col.names = c("Rule", "Confidence", "Lift", "Added Value", "Improvement")) %>%
      kableExtra::kable_styling(
        bootstrap_options = c("striped", "hover", "condensed"),
        full_width = FALSE,
        font_size = 11
      ) %>%
      kableExtra::scroll_box(width = "100%", height = "300px")
    cat("\nFiltered rules count:", nrow(imp_filtered), "/", nrow(imp_df), "\n")
  }
} else {
  cat("trans not found; skipping added value filtering.\n")
}
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.15    0.1    1 none FALSE            TRUE       5   0.001      2
##  maxlen target  ext
##       5  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 964 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[48452 item(s), 964462 transaction(s)] done [1.81s].
## sorting and recoding items ... [1773 item(s)] done [0.06s].
## creating transaction tree ... done [0.44s].
## checking subsets of size 1 2 3 4 done [0.16s].
## writing ... [1243 rule(s)] done [0.00s].
## creating S4 object  ... done [0.16s].
## === Added Value / Improvement Filtered Rules ===
## 
## Filtered rules count: 200 / 200

The scatter plot, titled “Added Value vs Improvement,” visualizes the distribution of filtered association rules based on their incremental predictive power. In this visualization, each point represents a rule, with the x-axis measuring the Added Value (how much the rule’s confidence exceeds the baseline popularity of the RHS) and the y-axis representing the Improvement (how much better the multi-item rule performs compared to its best single-item component).

There is a clear positive correlation between these two metrics, suggesting that rules with higher added value generally offer significant improvement over simpler antecedents. The points are colored by their Lift values, using a gradient from light blue to dark red. The cluster of dark red points in the upper right quadrant highlights the most valuable rules; these high-lift associations are not only statistically strong but also provide the most substantial information gain for predicting customer behavior. By identifying these “high-impact” rules—such as the specific yogurt or sparkling water combinations found in the results—we can distinguish truly meaningful product bundles from common, coincidental purchases.

# Visualization: Added Value vs Improvement
if (exists("imp_df") && nrow(imp_df) > 0) {
  df_imp <- imp_df
  df_imp <- df_imp[!is.na(df_imp$improvement), ]
  ggplot(df_imp, aes(x = added_value, y = improvement, color = lift)) +
    geom_point(alpha = 0.6) +
    scale_color_gradient(low = "lightblue", high = "darkred") +
    labs(title = "Added Value vs Improvement", x = "Added Value", y = "Improvement", color = "Lift") +
    theme_minimal()
} else {
  cat("imp_df not found; skip 5.4 plot.
")
}

6. Visualization & Summary

This section provides visual summaries of the final association rules. The network graph shows overall connectivity, while the chord diagram highlights strong pairwise links.

I use an interactive rule graph and a chord diagram to complement numeric rule summaries. The graph view highlights dense clusters of co-purchased items, while the chord diagram emphasizes strong, high-lift cross-category connections. Together, they help translate rule tables into intuitive relationships that can guide merchandising and recommendation design.

The visualization exposes two distinct structural patterns. First, the Network Graph displays a “Hub-and-Spoke” topology, where “Bag of Organic Bananas” acts as the central gravitational node, heavily linked to almost all other fresh produce items. Second, the Chord Diagram reveals a tightly knit “Organic Ecosystem”: thick arcs connect “Organic Hass Avocado”, “Organic Baby Spinach”, and “Organic Strawberries”, visually confirming that organic shoppers tend to buy across the entire category. Notably, the high-lift pair {Organic Cilantro} <=> {Limes} appears as a distinct, strong edge, visually separating itself from the general fruit cluster, marking it as a specific culinary behavior.

#  Plot interactive network (top 20 strong rules); click nodes to view association paths
# Use head() to avoid subscript out of bounds when rules_sorted has < 20 rules
plot(head(rules_sorted, 20), method = "graph", engine = "htmlwidget")

Extract and clean top rules into a pairwise table for the chord diagram.

# Ensure tidyverse (pipes, stringr) is loaded

# If tidyverse fails, try: library(dplyr); library(stringr)

# Chord diagram prep
top_rules_df <- as(head(rules_sorted, 15), "data.frame") %>%
  separate(rules, into = c("lhs", "rhs"), sep = " => ") %>%
  mutate(lhs = str_remove_all(lhs, "[{}]"),
         rhs = str_remove_all(rhs, "[{}]"))

Visualize strong pairwise links among top rules using a chord plot.

# Circlize chord diagram


try(circos.clear(), silent = TRUE)

items <- unique(c(top_rules_df$lhs, top_rules_df$rhs))
grid_col <- setNames(rep(brewer.pal(8, "Pastel2"), length.out = length(items)), items)

chordDiagram(top_rules_df[, c("lhs", "rhs", "lift")],
             grid.col = grid_col,
             annotationTrack = "grid",
             preAllocateTracks = list(track.height = 0.3))

circos.track(track.index = 1, panel.fun = function(x, y) {
  circos.text(CELL_META$xcenter, CELL_META$ylim[1], CELL_META$sector.index,
              facing = "clockwise", niceFacing = TRUE, adj = c(0, 0.5), cex = 0.6)
}, bg.border = NA)
title("Top 15 Product Association Chord Diagram")

# Summary of rule quality and top rules by Lift / Kulczynski
quality_summary <- quality(rules_positive)
quality_summary_df <- data.frame(
  Metric = c("Support", "Confidence", "Lift", "Kulczynski", "Certainty"),
  Min = c(min(quality_summary$support), min(quality_summary$confidence), 
          min(quality_summary$lift), 
          min(quality_summary$kulczynski, na.rm = TRUE),
          min(quality_summary$certainty, na.rm = TRUE)),
  Mean = c(mean(quality_summary$support), mean(quality_summary$confidence), 
           mean(quality_summary$lift),
           mean(quality_summary$kulczynski, na.rm = TRUE),
           mean(quality_summary$certainty, na.rm = TRUE)),
  Max = c(max(quality_summary$support), max(quality_summary$confidence), 
          max(quality_summary$lift),
          max(quality_summary$kulczynski, na.rm = TRUE),
          max(quality_summary$certainty, na.rm = TRUE))
)
formattable::formattable(quality_summary_df)
Metric Min Mean Max
Support 0.0051044 0.009380302 0.01954561
Confidence 0.2527340 0.293323569 0.37895907
Lift 1.7420349 2.400580954 5.78901715
Kulczynski 0.1474954 0.184031523 0.22967472
Certainty 0.1283321 0.187373824 0.27155244
cat("\n=== Top 5 by Lift ===\n")
## 
## === Top 5 by Lift ===
inspect(head(sort(rules_positive, by = "lift"), 5))
##     lhs                                       rhs                          support confidence   coverage     lift count kulczynski certainty
## [1] {Organic Cilantro}                     => {Limes}                  0.005439302  0.2527340 0.02152184 5.789017  5246  0.1886622 0.2186210
## [2] {Organic Large Extra Fancy Fuji Apple} => {Bag of Organic Bananas} 0.007272448  0.3089323 0.02354059 2.619325  7014  0.1852964 0.2165268
## [3] {Organic Fuji Apple}                   => {Banana}                 0.010606950  0.3789591 0.02798970 2.570156 10230  0.2254485 0.2715524
## [4] {Organic Raspberries}                  => {Bag of Organic Bananas} 0.012699308  0.2966264 0.04281247 2.514988 12248  0.2021496 0.2025753
## [5] {Organic Hass Avocado}                 => {Bag of Organic Bananas} 0.019545612  0.2936293 0.06656561 2.489576 18851  0.2296747 0.1991775
cat("\n=== Top 5 by Kulczynski ===\n")
## 
## === Top 5 by Kulczynski ===
if ("kulczynski" %in% names(quality(rules_positive))) {
  inspect(head(sort(rules_positive, by = "kulczynski", decreasing = TRUE), 5))
}
##     lhs                       rhs                      support     confidence
## [1] {Organic Hass Avocado} => {Bag of Organic Bananas} 0.019545612 0.2936293 
## [2] {Organic Fuji Apple}   => {Banana}                 0.010606950 0.3789591 
## [3] {Honeycrisp Apple}     => {Banana}                 0.008993615 0.3606953 
## [4] {Organic Avocado}      => {Banana}                 0.016829072 0.3058241 
## [5] {Organic Raspberries}  => {Bag of Organic Bananas} 0.012699308 0.2966264 
##     coverage   lift     count kulczynski certainty
## [1] 0.06656561 2.489576 18851 0.2296747  0.1991775
## [2] 0.02798970 2.570156 10230 0.2254485  0.2715524
## [3] 0.02493411 2.446288  8674 0.2108456  0.2501300
## [4] 0.05502861 2.074144 16231 0.2099806  0.1857690
## [5] 0.04281247 2.514988 12248 0.2021496  0.2025753

7. Conclusions and Application Recommendations

Main Findings

  1. Organic items often appear together: Organic fruits and vegetables show clear co-purchase patterns.
  2. Extra metrics help: Kulczynski and Certainty Factor reduce false links from very popular products.
  3. Long-tail pattern: The top 20% of products take a large share of volume, so rules about top items are most impactful, while long-tail items help for niche recommendations.
  4. Recommendation ideas: Use cross-recommendations and bundle discounts based on rules like {Organic A} → {Organic B}. For example, promote bananas to lift strawberry baskets if the rule has high Lift.

Application

  • Cross-recommendation and bundle offers; align inventory and promotion with associated products.

Limitations

Sampling bias; parameter sensitivity; Apriori scalability.

8. Technical Appendix

  • R: 4.3.1; Packages: arules, arulesViz, data.table, tidyverse, circlize, RColorBrewer.
  • Parameters: supp=0.005, conf=0.25, maxlen=4 (set to avoid too many rules).
  • Optimization: data.table::fread for fast reads; select columns only; order-based sampling (20%-50%); rm() and gc() for memory management; set.seed(2026) for reproducibility.

9. References

  1. Agrawal, R., Imieliński, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. ACM SIGMOD Record, 22(2), 207-216.
  2. Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. ACM SIGMOD Record, 29(2), 1-12.
  3. Hahsler, M., Grün, B., & Hornik, K. (2005). arules—A computational environment for mining association rules and frequent item sets. Journal of Statistical Software, 14(15), 1-25.
  4. Tan, P. N., Steinbach, M., & Kumar, V. (2016). Introduction to Data Mining. Pearson.