Abstract

This report presents a comprehensive market basket analysis using association rule mining techniques. The analysis identifies product bundles and purchasing patterns that can inform strategic business decisions such as product placement, cross-selling strategies, and promotional bundle creation. Using the Apriori algorithm, I extracted meaningful associations between products and visualized these relationships through multiple interactive and static visualizations.

Key Findings:

  • Identified high-confidence product associations with lift values indicating purchasing patterns significantly stronger than random chance
  • Clustered products into natural affinity groups based on co-occurrence patterns
  • Generated actionable bundle recommendations with specific marketing strategies

1. Introduction

1.1 Market Basket Analysis Overview

Market Basket Analysis (MBA) is a data mining technique used to discover associations between items that customers purchase together. The primary goal is to identify which products are frequently bought in combination, enabling businesses to:

  • Optimize product placement in physical or digital stores
  • Create effective product bundles and promotional offers
  • Enhance cross-selling and up-selling strategies
  • Improve inventory management based on product dependencies

1.2 Association Rules Methodology

Association rules are expressed in the form: {A, B} → {C}, which reads as “if a customer purchases products A and B, they are likely to purchase product C.”

1.2.1 Key Metrics

Three fundamental metrics evaluate association rules:

  1. Support: The proportion of transactions containing the itemset
    • Formula: \(Support(A \rightarrow B) = \frac{Transactions\ containing\ both\ A\ and\ B}{Total\ transactions}\)
    • Indicates how frequently the pattern occurs in the dataset
  2. Confidence: The probability that item B is purchased when item A is purchased
    • Formula: \(Confidence(A \rightarrow B) = \frac{Support(A \cup B)}{Support(A)}\)
    • Measures the reliability of the inference made by the rule
  3. Lift: The ratio of observed support to expected support if A and B were independent
    • Formula: \(Lift(A \rightarrow B) = \frac{Support(A \cup B)}{Support(A) \times Support(B)}\)
    • Values > 1 indicate positive correlation; values < 1 indicate negative correlation
    • Lift = 1 means A and B are independent (no relationship)

1.2.2 The Apriori Algorithm

The Apriori algorithm (Agrawal & Srikant, 1994) is the most widely used method for mining association rules. It operates on the principle that:

“If an itemset is frequent, then all of its subsets must also be frequent”

This property allows the algorithm to efficiently prune the search space by eliminating infrequent itemsets early in the process.

Algorithm Steps:

  1. Generate frequent itemsets (those meeting minimum support threshold)
  2. Generate association rules from frequent itemsets (meeting minimum confidence)
  3. Filter rules based on lift and other quality measures

2. Data Description

The dataset comprises 464 transactions and 22 product variables, structured in a binary market-basket format in which each row represents an individual customer purchase event and each column corresponds to a distinct product category. Item incidence is encoded dichotomously (1 = purchased, 0 = not purchased), thereby capturing the presence or absence of products within each shopping basket rather than purchase quantities.

cat("\n=== STEP 1: Loading Data ===\n")
## 
## === STEP 1: Loading Data ===
# Loading the binary transaction matrix
data_raw <- read.csv("market.csv", sep = ";", header = TRUE, stringsAsFactors = FALSE)

The product space spans a mixed assortment of grocery and household goods—including staple foods (e.g., bread, bacon, cheese), fresh produce (e.g., banana, carrot, apple), pantry items (e.g., flour, sugar, salt), and non-food household products (e.g., toothpaste, shampoo, shaving foam). This composition reflects typical supermarket consumption patterns and provides a heterogeneous item universe suitable for co-occurrence analysis.

# Displaying basic info
cat("\nDataset dimensions:", dim(data_raw), "\n")
## 
## Dataset dimensions: 464 22
cat("Number of transactions:", nrow(data_raw), "\n")
## Number of transactions: 464
cat("Number of products:", ncol(data_raw), "\n")
## Number of products: 22
cat("\nFirst few rows:\n")
## 
## First few rows:
print(head(data_raw, 3))
##   Bread Honey Bacon Toothpaste Banana Apple Hazelnut Cheese Meat Carrot
## 1     1     0     1          0      1     1        1      0    0      1
## 2     1     1     1          0      1     1        1      0    0      0
## 3     0     1     1          1      1     1        1      1    1      0
##   Cucumber Onion Milk Butter ShavingFoam Salt Flour HeavyCream Egg Olive
## 1        0     0    0      0           0    0     0          1   1     0
## 2        1     0    1      1           0    0     1          0   0     1
## 3        1     1    1      0           1    1     1          1   1     0
##   Shampoo Sugar
## 1       0     1
## 2       1     0
## 3       0     1

3. Data Transformation

The transformation yields a dataset comprising 464 discrete transactions, each representing a unique shopping basket or customer purchase event. This preprocessed structure enables the Apriori algorithm to efficiently identify co-occurrence patterns by rapidly querying which items appear together across transactions, thereby forming the foundation for subsequent association rule discovery and business intelligence generation.

# Converting to transaction format
# Method 1: If data has transaction IDs in first column
if (!all(data_raw[,1] %in% c(0,1))) {
  trans_data <- as.matrix(data_raw[,-1])
  rownames(trans_data) <- data_raw[,1]
} else {
  # Method 2: All columns are products
  trans_data <- as.matrix(data_raw)
}

# Converting to logical matrix then to transactions
trans_logical <- trans_data == 1
trans <- as(trans_logical, "transactions")

cat("\nTransactions object created successfully!\n")
## 
## Transactions object created successfully!
cat("Total transactions:", length(trans), "\n")
## Total transactions: 464
cat("Total unique items:", length(itemLabels(trans)), "\n")
## Total unique items: 22

The resulting transactional dataset comprises 22 unique items, each corresponding to a distinct product category captured in the original data. Each item is treated as a binary attribute, indicating whether a given product was purchased in a particular transaction or not.

4. Exploratory Data Analysis

The exploratory frequency analysis identified the ten most frequently purchased products by computing absolute item occurrence counts across all transactions.

# Calculating item frequencies
item_freq <- itemFrequency(trans, type = "absolute")
item_freq_pct <- itemFrequency(trans, type = "relative")

# Top 10 most popular items
cat("\nTop 10 Most Purchased Products:\n")
## 
## Top 10 Most Purchased Products:
top10 <- sort(item_freq, decreasing = TRUE)[1:10]
print(top10)
##      Banana      Cheese       Bacon    Hazelnut       Honey  HeavyCream 
##         208         206         200         195         193         193 
##      Carrot       Bread       Apple ShavingFoam 
##         192         189         188         188

The results indicate that Banana was the most popular item, appearing in 208 transactions, closely followed by Cheese (206) and Bacon (200). A second tier of high-frequency products includes Hazelnut (195), Honey (193), and Heavy Cream (193), each demonstrating strong but slightly lower purchase prevalence. The remaining items within the top ten comprise Carrot (192), Bread (189), Apple (188), and Shaving Foam (188). Collectively, these quantities reflect the highest-demand products within the transactional dataset, as determined by absolute purchase frequency, and therefore represent the core items driving co-occurrence patterns in subsequent association rule analysis.

4.1 Visualizing Exploratory Data Analysis

4.1.1 Distribution of the Top 10 Most Frequently Purchased Items

# Visualizing Top 10 items
par(mar = c(10, 4, 4, 2))
itemFrequencyPlot(trans, topN = 10, type = "absolute", 
                  col = brewer.pal(8, "Set2"),
                  main = "Top 10 Most Frequent Items",
                  ylab = "Frequency (Absolute Count)",
                  cex.names = 0.8, las = 2)

The frequency range is relatively narrow—spanning approximately from 188 to 208 occurrences—indicating a moderately even demand concentration among the leading products rather than the dominance of a single item. From a statistical perspective, the limited dispersion between first and tenth rank implies low variability among the most popular items, reflecting stable, recurrent purchasing patterns. Substantively, the graph indicates that everyday consumables—particularly fresh fruit, dairy, and breakfast-related products—constitute the core drivers of transaction frequency, thereby representing high-priority candidates for promotional bundling and association rule generation.

4.1.2 Relative Frequency Distribution of the Top 10 Purchased Items

# Additional plot - relative frequency
itemFrequencyPlot(trans, topN = 10, type = "relative", 
                  col = brewer.pal(8, "Pastel1"),
                  main = "Top 10 Items - Relative Frequency",
                  ylab = "Frequency (% of Transactions)",
                  cex.names = 0.8, las = 2)

Relative frequency indicates what percentage of customers purchase each item.

Banana, Cheese, and Bacon exhibit the highest relative frequencies (≈43–45%), identifying them as primary demand drivers. These products should receive priority in inventory planning, including higher stock levels, frequent replenishment cycles, and strategic shelf placement, as stockouts in these categories would affect nearly half of all transactions.

A second investment tier comprises Hazelnut, Honey, and Heavy Cream (≈41–42%). Their strong but slightly lower penetration indicates reliable, repeat demand, making them suitable candidates for promotional bundling—particularly with complementary staples such as bakery or breakfast items.

The remaining high-frequency goods—Carrot, Bread, Apple, and Shaving Foam (≈40–41%)—also warrant sustained inventory commitment. Notably, the inclusion of both fresh produce and personal care items suggests cross-category purchasing routines, supporting mixed-product bundling strategies.

4.2 Transaction Statistics

# Transaction size distribution
trans_sizes <- size(trans)
cat("\nTransaction Statistics:\n")
## 
## Transaction Statistics:
cat("Average items per transaction:", round(mean(trans_sizes), 2), "\n")
## Average items per transaction: 8.79
cat("Median items per transaction:", median(trans_sizes), "\n")
## Median items per transaction: 9
cat("Max items in single transaction:", max(trans_sizes), "\n")
## Max items in single transaction: 17

4.2.1 Visualizing basket sizes

# Plot transaction size distribution
hist(trans_sizes, 
     breaks = 20, 
     col = "steelblue",
     main = "Distribution of Basket Sizes",
     xlab = "Number of Items per Transaction",
     ylab = "Frequency")
abline(v = mean(trans_sizes), col = "red", lwd = 2, lty = 2)
legend("topright", legend = paste("Mean =", round(mean(trans_sizes), 2)), 
       col = "red", lty = 2, lwd = 2)

The histogram depicts the distribution of basket sizes, measured as the number of items purchased per transaction, thereby illustrating customer purchasing intensity across the 464 observed shopping events. The distribution appears moderately bell-shaped, with the highest concentration of transactions clustered between approximately 8 and 12 items, indicating that mid-sized baskets dominate purchasing behaviour.

The mean basket size is 8.79 items, as indicated by the red dashed reference line. This value represents the average number of products purchased per shopping trip and suggests that a typical customer buys roughly nine items per transaction. The proximity of the mean to the central mass of the histogram indicates a relatively balanced distribution without extreme skewness.

5. Association Rule Mining

5.1 Executing Apriori Algorithm

Parameter Selection Rationale:

  • Support = 0.005: Captures patterns occurring in at least 0.5% of transactions, allowing discovery of niche but meaningful associations while filtering noise
  • Confidence = 0.5: Ensures that the consequent occurs in at least 50% of transactions containing the antecedent, providing reliable predictions
  • Minlen = 2: Excludes single-item patterns, focusing on actual associations
  • Maxlen = 10: Allows complex multi-item bundles while maintaining computational efficiency

Lift is the most critical metric for identifying meaningful associations, as it accounts for the popularity of individual items.

Why Lift > 1.2?

  • Lift = 1.2 means the items co-occur 20% more frequently than expected by chance
  • This threshold filters out weak associations while retaining actionable insights
  • Industry standard for retail analytics typically ranges from 1.2 to 1.5
cat("\n=== STEP 3: Running Apriori Algorithm ===\n")
## 
## === STEP 3: Running Apriori Algorithm ===
# Generating rules with specified parameters
rules_all <- apriori(trans, 
                     parameter = list(supp = 0.005,  # Low support for niche bundles
                                      conf = 0.5,      # 50% confidence threshold
                                      minlen = 2,      # At least 2 items
                                      maxlen = 10),
                     control = list(verbose = TRUE))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.005      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 2 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[22 item(s), 464 transaction(s)] done [0.00s].
## sorting and recoding items ... [22 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
##  done [0.13s].
## writing ... [1301947 rule(s)] done [0.34s].
## creating S4 object  ... done [1.31s].
cat("\nTotal rules generated:", length(rules_all), "\n")
## 
## Total rules generated: 1301947
# Filtering for high-lift rules (Lift > 1.2)
rules_high_lift <- subset(rules_all, lift > 1.2)
cat("Rules with Lift > 1.2:", length(rules_high_lift), "\n")
## Rules with Lift > 1.2: 1270920

Under these parameters, the algorithm generated a total of 1,301,947 association rules, reflecting the extensive combinatorial structure of product co-occurrence within the dataset. Each rule represents a probabilistic implication of the form {Item A, Item B} → {Item C}, indicating that customers purchasing the antecedent set are statistically likely to also purchase the consequent item.

To refine analytical relevance, the rule set was subsequently filtered using a lift threshold greater than 1.2, reducing the list to 1,270,920 high-lift rules. Lift measures the strength of association relative to random co-occurrence; values above 1 indicate positive dependence, while values exceeding 1.2 denote substantively meaningful purchase affinity. This filtering step therefore removed weak or coincidental associations, retaining only those rules with stronger cross-product linkage.

Products that repeatedly appear together in high-support, high-confidence, and high-lift rules can be interpreted as complementary goods, suitable for bundling, co-promotion, or adjacency placement in retail layouts.

5.2 Summary Statistics

# Display summary statistics
cat("\n--- Rule Quality Metrics ---\n")
## 
## --- Rule Quality Metrics ---
summary(rules_high_lift)
## set of 1270920 rules
## 
## rule length distribution (lhs + rhs):sizes
##      2      3      4      5      6      7      8      9     10 
##     38   1826  14700  71082 244432 471191 340910 107946  18795 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   6.000   7.000   7.138   8.000  10.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift      
##  Min.   :0.006466   Min.   :0.5000   Min.   :0.006466   Min.   :1.200  
##  1st Qu.:0.006466   1st Qu.:0.5833   1st Qu.:0.008621   1st Qu.:1.442  
##  Median :0.008621   Median :0.6667   Median :0.012931   Median :1.673  
##  Mean   :0.010774   Mean   :0.7085   Mean   :0.016474   Mean   :1.764  
##  3rd Qu.:0.010776   3rd Qu.:0.8000   3rd Qu.:0.017241   3rd Qu.:1.985  
##  Max.   :0.241379   Max.   :1.0000   Max.   :0.448276   Max.   :2.729  
##      count        
##  Min.   :  3.000  
##  1st Qu.:  3.000  
##  Median :  4.000  
##  Mean   :  4.999  
##  3rd Qu.:  5.000  
##  Max.   :112.000  
## 
## mining info:
##   data ntransactions support confidence
##  trans           464   0.005        0.5
##                                                                                                                        call
##  apriori(data = trans, parameter = list(supp = 0.005, conf = 0.5, minlen = 2, maxlen = 10), control = list(verbose = TRUE))

A total of 1,270,920 association rules were retained after lift filtering, with rule lengths most commonly ranging between 6 and 8 items (median = 7; mean ≈ 7.14), indicating moderately complex co-purchase structures. Support values are generally low (mean ≈ 0.0108), reflecting that most rules describe niche rather than mass purchasing patterns, while confidence levels are relatively strong (mean ≈ 0.71), suggesting reliable predictive relationships. Lift statistics (mean ≈ 1.76; max = 2.73) confirm the presence of meaningful positive product affinities, where co-occurrence exceeds random expectation.

6. Bundle Identification - Filtering for Actionable Budles

For business implementation, I focused on simple bundle structures that are easy to communicate and operationalize.

Bundle Format Rationale:

  • {1-2 items} → {1 item} format is optimal for:
    • Clear communication: “Buy A (and B) to get C recommendation”
    • Easy implementation in recommendation systems
    • Straightforward promotional messaging
    • Simple inventory and pricing management
cat("\n=== STEP 4: Identifying Optimal Product Bundles ===\n")
## 
## === STEP 4: Identifying Optimal Product Bundles ===
# Filtering for simple bundles: LHS (1-2 items) => RHS (1 item)
# This format is best for "Buy A+B, Get C" bundles

# Getting LHS and RHS sizes
lhs_sizes <- size(lhs(rules_high_lift))
rhs_sizes <- size(rhs(rules_high_lift))

# Filter criteria
bundle_rules <- rules_high_lift[lhs_sizes >= 1 & lhs_sizes <= 2 & rhs_sizes == 1]

cat("\nFiltered bundle rules (LHS: 1-2 items, RHS: 1 item):", length(bundle_rules), "\n")
## 
## Filtered bundle rules (LHS: 1-2 items, RHS: 1 item): 1864
# Sort by lift to find strongest associations
bundle_rules_sorted <- sort(bundle_rules, by = "lift", decreasing = TRUE)

cat("\n--- TOP 20 PRODUCT BUNDLE OPPORTUNITIES ---\n")
## 
## --- TOP 20 PRODUCT BUNDLE OPPORTUNITIES ---
inspect(head(bundle_rules_sorted, 20))
##      lhs                     rhs           support    confidence coverage 
## [1]  {Bacon, Cheese}      => {Butter}      0.13793103 0.6153846  0.2241379
## [2]  {Bacon, Sugar}       => {Meat}        0.11853448 0.6321839  0.1875000
## [3]  {Cheese, Onion}      => {Butter}      0.11637931 0.6067416  0.1918103
## [4]  {Meat, Salt}         => {Sugar}       0.10344828 0.5925926  0.1745690
## [5]  {Hazelnut, Shampoo}  => {Butter}      0.10560345 0.6049383  0.1745690
## [6]  {Bacon, Toothpaste}  => {Butter}      0.10560345 0.6049383  0.1745690
## [7]  {Carrot, Butter}     => {Toothpaste}  0.09913793 0.6133333  0.1616379
## [8]  {Bread, Onion}       => {Butter}      0.09913793 0.5974026  0.1659483
## [9]  {Milk, ShavingFoam}  => {HeavyCream}  0.09698276 0.6617647  0.1465517
## [10] {Honey, Bacon}       => {Meat}        0.12500000 0.6170213  0.2025862
## [11] {Meat, Salt}         => {Shampoo}     0.10129310 0.5802469  0.1745690
## [12] {Hazelnut, Cheese}   => {Butter}      0.12284483 0.5937500  0.2068966
## [13] {HeavyCream, Sugar}  => {Salt}        0.11422414 0.6309524  0.1810345
## [14] {Onion, Sugar}       => {Toothpaste}  0.09913793 0.6052632  0.1637931
## [15] {Bacon, ShavingFoam} => {Butter}      0.12715517 0.5900000  0.2155172
## [16] {Milk, Salt}         => {HeavyCream}  0.09698276 0.6521739  0.1487069
## [17] {Bacon, HeavyCream}  => {Milk}        0.11637931 0.5806452  0.2004310
## [18] {Banana, Butter}     => {Bacon}       0.12068966 0.6746988  0.1788793
## [19] {Bacon, Onion}       => {Banana}      0.13146552 0.7011494  0.1875000
## [20] {Bacon, Onion}       => {ShavingFoam} 0.11853448 0.6321839  0.1875000
##      lift     count
## [1]  1.641026 64   
## [2]  1.629630 55   
## [3]  1.617978 54   
## [4]  1.617429 48   
## [5]  1.613169 49   
## [6]  1.613169 49   
## [7]  1.598801 46   
## [8]  1.593074 46   
## [9]  1.590978 45   
## [10] 1.590544 58   
## [11] 1.583733 47   
## [12] 1.583333 57   
## [13] 1.582497 53   
## [14] 1.577765 46   
## [15] 1.573333 59   
## [16] 1.567921 45   
## [17] 1.566392 54   
## [18] 1.565301 56   
## [19] 1.564103 61   
## [20] 1.560284 55

7. Advanced Filtering & Cleaning

Raw association rules often contain redundant or statistically insignificant patterns. I applied three refinement techniques:

cat("\n=== STEP 5: Cleaning Rules ===\n")
## 
## === STEP 5: Cleaning Rules ===
# Removing redundant rules
bundle_rules_clean <- bundle_rules_sorted[!is.redundant(bundle_rules_sorted)]
cat("After removing redundant rules:", length(bundle_rules_clean), "\n")
## After removing redundant rules: 1782
# Keeping only significant rules (Fisher's exact test)
bundle_rules_clean <- bundle_rules_clean[is.significant(bundle_rules_clean, trans)]
cat("After filtering for statistical significance:", length(bundle_rules_clean), "\n")
## After filtering for statistical significance: 1187
# Keeping only maximal rules
bundle_rules_clean <- bundle_rules_clean[is.maximal(bundle_rules_clean)]
cat("After keeping only maximal rules:", length(bundle_rules_clean), "\n")
## After keeping only maximal rules: 1149
# Final top bundles
cat("\n--- FINAL TOP 15 BUNDLE RECOMMENDATIONS ---\n")
## 
## --- FINAL TOP 15 BUNDLE RECOMMENDATIONS ---
inspect(head(bundle_rules_clean, 15))
##      lhs                     rhs          support    confidence coverage 
## [1]  {Bacon, Cheese}      => {Butter}     0.13793103 0.6153846  0.2241379
## [2]  {Bacon, Sugar}       => {Meat}       0.11853448 0.6321839  0.1875000
## [3]  {Cheese, Onion}      => {Butter}     0.11637931 0.6067416  0.1918103
## [4]  {Meat, Salt}         => {Sugar}      0.10344828 0.5925926  0.1745690
## [5]  {Hazelnut, Shampoo}  => {Butter}     0.10560345 0.6049383  0.1745690
## [6]  {Bacon, Toothpaste}  => {Butter}     0.10560345 0.6049383  0.1745690
## [7]  {Carrot, Butter}     => {Toothpaste} 0.09913793 0.6133333  0.1616379
## [8]  {Bread, Onion}       => {Butter}     0.09913793 0.5974026  0.1659483
## [9]  {Milk, ShavingFoam}  => {HeavyCream} 0.09698276 0.6617647  0.1465517
## [10] {Honey, Bacon}       => {Meat}       0.12500000 0.6170213  0.2025862
## [11] {Meat, Salt}         => {Shampoo}    0.10129310 0.5802469  0.1745690
## [12] {Hazelnut, Cheese}   => {Butter}     0.12284483 0.5937500  0.2068966
## [13] {HeavyCream, Sugar}  => {Salt}       0.11422414 0.6309524  0.1810345
## [14] {Onion, Sugar}       => {Toothpaste} 0.09913793 0.6052632  0.1637931
## [15] {Bacon, ShavingFoam} => {Butter}     0.12715517 0.5900000  0.2155172
##      lift     count
## [1]  1.641026 64   
## [2]  1.629630 55   
## [3]  1.617978 54   
## [4]  1.617429 48   
## [5]  1.613169 49   
## [6]  1.613169 49   
## [7]  1.598801 46   
## [8]  1.593074 46   
## [9]  1.590978 45   
## [10] 1.590544 58   
## [11] 1.583733 47   
## [12] 1.583333 57   
## [13] 1.582497 53   
## [14] 1.577765 46   
## [15] 1.573333 59

Refinement Techniques Explained:

  1. Redundancy Removal: Eliminates rules that provide no additional information beyond their subsets. For example, if {A,B,C} → {D} exists with the same confidence as {A,B} → {D}, the former is redundant.

  2. Statistical Significance: Applies Fisher’s exact test to verify that the association is unlikely to occur by random chance (typically p < 0.05).

  3. Maximal Rules: Retains only the most specific rules that cannot be further extended while maintaining the same support. This prevents listing multiple rules that essentially describe the same pattern.

Business Value: These refinements ensure that marketing teams focus on unique, statistically validated insights rather than redundant information.

8. Interactive visualization of the data

8.1 Network graph

The network graph represents products as nodes and associations as directed edges, with edge thickness and color indicating rule strength.

cat("\n=== STEP 6: Creating Visualizations ===\n")
## 
## === STEP 6: Creating Visualizations ===
# Select top rules for visualization (to avoid clutter)
top_rules_viz <- head(bundle_rules_clean, 50)

# 1. Network Graph - Interactive
cat("\nGenerating interactive network graph...\n")
## 
## Generating interactive network graph...
plot(top_rules_viz, 
     method = "graph", 
     engine = "htmlwidget",
     control = list(type = "items"))
## Available control parameters (with default values):
## itemCol   =  #CBD2FC
## nodeCol   =  c("#EE0000", "#EE0303", "#EE0606", "#EE0909", "#EE0C0C", "#EE0F0F", "#EE1212", "#EE1515", "#EE1818", "#EE1B1B", "#EE1E1E", "#EE2222", "#EE2525", "#EE2828", "#EE2B2B", "#EE2E2E", "#EE3131", "#EE3434", "#EE3737", "#EE3A3A", "#EE3D3D", "#EE4040", "#EE4444", "#EE4747", "#EE4A4A", "#EE4D4D", "#EE5050", "#EE5353", "#EE5656", "#EE5959", "#EE5C5C", "#EE5F5F", "#EE6262", "#EE6666", "#EE6969", "#EE6C6C", "#EE6F6F", "#EE7272", "#EE7575", "#EE7878", "#EE7B7B", "#EE7E7E", "#EE8181", "#EE8484", "#EE8888", "#EE8B8B",  "#EE8E8E", "#EE9191", "#EE9494", "#EE9797", "#EE9999", "#EE9B9B", "#EE9D9D", "#EE9F9F", "#EEA0A0", "#EEA2A2", "#EEA4A4", "#EEA5A5", "#EEA7A7", "#EEA9A9", "#EEABAB", "#EEACAC", "#EEAEAE", "#EEB0B0", "#EEB1B1", "#EEB3B3", "#EEB5B5", "#EEB7B7", "#EEB8B8", "#EEBABA", "#EEBCBC", "#EEBDBD", "#EEBFBF", "#EEC1C1", "#EEC3C3", "#EEC4C4", "#EEC6C6", "#EEC8C8", "#EEC9C9", "#EECBCB", "#EECDCD", "#EECFCF", "#EED0D0", "#EED2D2", "#EED4D4", "#EED5D5", "#EED7D7", "#EED9D9", "#EEDBDB", "#EEDCDC", "#EEDEDE", "#EEE0E0",  "#EEE1E1", "#EEE3E3", "#EEE5E5", "#EEE7E7", "#EEE8E8", "#EEEAEA", "#EEECEC", "#EEEEEE")
## precision     =  3
## igraphLayout  =  layout_nicely
## interactive   =  TRUE
## engine    =  visNetwork
## max   =  100
## selection_menu    =  TRUE
## degree_highlight  =  1
## verbose   =  FALSE

Interpretation of the Interactive Association Network Graph

  • Network structure
    • The graph visualises association rules as a directed product network
    • Nodes represent individual products, while arrows denote rule direction (antecedent → consequent)
    • The structure highlights how items are interconnected within customer baskets
  • Node interpretation
    • Larger nodes indicate higher product support (frequent purchases)
    • Hub nodes with many connections represent high-demand anchor products
    • These items play a central role in basket formation
  • Edge (arrow) interpretation
    • Arrow direction shows purchase implication relationships
    • Warmer colours indicate stronger lift (greater-than-random association strength)
    • Thicker arrows reflect higher confidence (rule reliability)
  • Cluster interpretation
    • Densely connected node groups indicate strong mutual co-purchase behaviour
    • Such clusters represent natural product bundles or consumption ecosystems
    • Sparse or isolated nodes indicate weaker integration into basket networks
  • Business applications
    • Product placement: Position highly connected items adjacently in-store or online
    • Cross-category merchandising: Leverage unexpected inter-category links
    • Bundle creation: Design multi-item offers from dense clusters
    • Promotional planning: Use hub products as campaign anchors
  • Interactive analytical value
    • Hovering reveals product-level statistics and rule metrics
    • Dragging enables structural exploration of cluster proximity
    • Zooming isolates micro-patterns within dense regions
    • Edge selection allows inspection of specific rule relationships

8.2 Scatter Plot

This visualization reveals the relationship between rule frequency (support) and reliability (confidence), with lift as a third dimension.

plot(bundle_rules_clean, 
     measure = c("support", "confidence"), 
     shading = "lift",
     main = "Product Bundle Rules: Support vs Confidence (colored by Lift)")

  • Overall pattern
    • The distribution shows a positive relationship between support and confidence
    • Most rules cluster in mid-support (≈0.09–0.12) and mid-confidence (≈0.52–0.62) ranges
    • Lift shading indicates predominantly moderate-to-strong product affinities
  • Top-Right Quadrant (High Support, High Confidence) — Core revenue drivers
    • Frequent and highly reliable purchase combinations
    • Represent mainstream bundle opportunities
    • Should be prioritised for:
      • Bundle promotions
      • Shelf adjacency
      • Inventory investment
    • These products generate stable turnover and cross-sell profit
  • Top-Left Quadrant (Low Support, High Confidence) — Niche profit pockets
    • Less frequent but highly predictive associations
    • Suitable for:
      • Personalised recommendations
      • Premium or specialty bundles
      • Targeted marketing campaigns
    • High margin potential despite lower volume
  • Bottom-Right Quadrant (High Support, Low Confidence) — Volume without synergy
    • Frequently purchased but weakly associated items
    • Customers buy them often, but not consistently together
    • Better suited for:
      • Standalone promotions
      • Loss-leader pricing strategies
    • Bundling may dilute profitability
  • Bottom-Left Quadrant (Low Support, Low Confidence) — Income drainers
    • Rare and unreliable purchase patterns
    • Weak lift and low transactional relevance
    • Limited bundling or promotional value
    • Candidates for:
      • Inventory reduction
      • Delisting review
      • Clearance or markdown strategies
  • Lift-based prioritisation
    • Warmer/red points indicate stronger-than-random associations
    • High-lift rules within top quadrants signal premium bundling leverage
    • Cooler points reflect weak economic synergy
  • Strategic summary
    • Invest in high-support, high-confidence clusters
    • Monetise niche high-confidence rules via targeted offers
    • Avoid bundling low-confidence combinations
    • Rationalise inventory tied to low-support, low-lift rules

8.3 Grouped Matrix Plot

The grouped matrix organizes rules by clustering similar antecedents and consequents, revealing structural patterns in customer behavior.

plot(head(bundle_rules_clean, 30), 
     method = "grouped",
     control = list(k = 5),
     main = "Grouped Rules - Top 30 Bundles")
## Available control parameters (with default values):
## k     =  20
## aggr.fun  =  function (x, ...)  UseMethod("mean")
## rhs_max   =  10
## lhs_label_items   =  2
## col   =  c("#EE0000FF", "#EEEEEEFF")
## groups    =  NULL
## engine    =  ggplot2
## verbose   =  FALSE

  • Cluster 1 — Staple purchases (everyday items)
    • Dense diagonal concentration of rules
    • Darker cells indicate strong lift among routine necessities
    • Larger circles reflect high support and frequent co-purchase
    • Represents core household replenishment baskets
  • Cluster 2 — Specialty / premium combinations
    • Selective darker cells with smaller support circles
    • Strong associations but lower transaction frequency
    • Indicates niche or higher-value product pairings
    • Reflects targeted or preference-driven purchasing
  • Cluster 3 — Complementary product categories
    • Moderate-to-strong lift within functionally related goods
    • Medium circle sizes signal consistent but not dominant support
    • Products purchased together for shared usage or consumption context
    • Suggests natural bundling opportunities
  • Cluster 4 — Seasonal / promotional patterns
    • Scattered cell intensity across the matrix
    • Variable lift and smaller support levels
    • Associations appear episodic rather than routine
    • Likely driven by discounts, holidays, or campaigns
  • Cluster 5 — Cross-category purchase behaviours
    • Pronounced off-diagonal hotspots
    • Mix of grocery and non-grocery consequents
    • Indicates shoppers combining unrelated categories in one trip
    • Reflects convenience-driven or one-stop shopping missions

From a purchasing standpoint, several staple-driven clusters are evident. Antecedent groups containing everyday consumables—such as dairy, breakfast, and pantry items—show strong lift relationships with complementary goods like Milk, Butter, Heavy Cream, and Bacon, indicating routine household replenishment missions. These diagonal concentrations suggest coherent within-cluster shopping patterns where customers repeatedly purchase functionally related staples in the same trip. Conversely, clusters featuring mixed grocery–household antecedents display off-diagonal hotspots linking items such as Toothpaste, Shampoo, and Meat, revealing cross-category baskets that combine food shopping with personal care restocking.

8.4 Network Graph - Static Version

plot(head(bundle_rules_clean, 20), 
     method = "graph",
     control = list(type = "items"),
     main = "Top 20 Bundle Rules - Network View")
## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

The network highlights Butter, Bacon, and Cheese as the most commercially valuable bundle anchors, evidenced by the darkest red nodes and edges (highest lift ≈ 1.62–1.64) and relatively large support, indicating strong and frequent co-purchase profitability. Mid-tier red associations connect Toothpaste–Shampoo and Carrot–Hazelnut, suggesting cross-category and specialty bundles with moderate but actionable lift. Paler, weakly connected items such as Salt, Sugar, and Honey show lower lift and sparse linkages, marking them as redundant or low-synergy add-ons with limited bundling profit potential.

8.5 Parallel Coordinates Plot

plot(head(bundle_rules_clean, 20), 
     method = "paracoord",
     control = list(reorder = TRUE),
     main = "Bundle Rule Patterns - Parallel Coordinates")

The most commercially advantageous transitions converge on Butter, Milk, and Meat, marking them as high-impact add-on products when paired with baskets containing staples such as Cheese, Bread, or Bacon. Conversely, lighter, fragmented paths linked to items like Sugar, Salt, and Shampoo reflect weaker, less monetizable associations, signalling limited bundling leverage and lower incremental revenue potential.

8.6 Interactive Rule Explorer

The Interactive Rule Explorer enables real-time filtering, sorting, and querying of rules by key quality metrics—such as support, confidence, and lift—allowing marketing teams to rapidly isolate the most commercially relevant product combinations.

cat("\n=== STEP 7: Launching Interactive Rule Explorer ===\n")
## 
## === STEP 7: Launching Interactive Rule Explorer ===
cat("This will open an interactive dashboard in your browser...\n")
## This will open an interactive dashboard in your browser...
# Launch the interactive rule explorer
ruleExplorer(bundle_rules_clean)
Shiny applications not supported in static R Markdown documents
# Alternative: Using inspectDT for interactive table
library(DT)
inspectDT(head(bundle_rules_clean, 50))

9.Business Recommendations

## 
## === STEP 8: BUSINESS RECOMMENDATIONS ===
##     lhs                rhs      support  confidence coverage  lift     count
## [1] {Bacon, Cheese} => {Butter} 0.137931 0.6153846  0.2241379 1.641026 64   
## ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
## BUNDLE RECOMMENDATION # 1 
## ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
## 
## 📦 Bundle Name Suggestion:
##    'Bacon,Cheese + Butter Value Pack'
## 
## 🛒 Bundle Components:
##    Base Items:  Bacon,Cheese 
##    Add-on Item:  Butter 
## 
## 📊 Performance Metrics:
##    • Lift: 1.64 (64.1% stronger than random)
##    • Confidence: 61.5% (customers who buy base items also buy add-on)
##    • Support: 13.79% (occurs in 64 transactions)
## 
## 💡 Marketing Strategy:
##    RECOMMENDED - Strong complementary relationship
##    → Feature as 'Frequently Bought Together'
##    → Offer 5-10% bundle discount
##    → Add to product recommendation widgets
## 
## 🎯 Expected Impact:
##    If this bundle converts even 20% of applicable carts,
##    you could influence ~ 13  transactions.
## 
##     lhs               rhs    support   confidence coverage lift    count
## [1] {Bacon, Sugar} => {Meat} 0.1185345 0.6321839  0.1875   1.62963 55   
## ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
## BUNDLE RECOMMENDATION # 2 
## ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
## 
## 📦 Bundle Name Suggestion:
##    'Bacon,Sugar + Meat Value Pack'
## 
## 🛒 Bundle Components:
##    Base Items:  Bacon,Sugar 
##    Add-on Item:  Meat 
## 
## 📊 Performance Metrics:
##    • Lift: 1.63 (63% stronger than random)
##    • Confidence: 63.2% (customers who buy base items also buy add-on)
##    • Support: 11.85% (occurs in 55 transactions)
## 
## 💡 Marketing Strategy:
##    RECOMMENDED - Strong complementary relationship
##    → Feature as 'Frequently Bought Together'
##    → Offer 5-10% bundle discount
##    → Add to product recommendation widgets
## 
## 🎯 Expected Impact:
##    If this bundle converts even 20% of applicable carts,
##    you could influence ~ 11  transactions.
## 
##     lhs                rhs      support   confidence coverage  lift     count
## [1] {Cheese, Onion} => {Butter} 0.1163793 0.6067416  0.1918103 1.617978 54   
## ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
## BUNDLE RECOMMENDATION # 3 
## ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
## 
## 📦 Bundle Name Suggestion:
##    'Cheese,Onion + Butter Value Pack'
## 
## 🛒 Bundle Components:
##    Base Items:  Cheese,Onion 
##    Add-on Item:  Butter 
## 
## 📊 Performance Metrics:
##    • Lift: 1.62 (61.8% stronger than random)
##    • Confidence: 60.7% (customers who buy base items also buy add-on)
##    • Support: 11.64% (occurs in 54 transactions)
## 
## 💡 Marketing Strategy:
##    RECOMMENDED - Strong complementary relationship
##    → Feature as 'Frequently Bought Together'
##    → Offer 5-10% bundle discount
##    → Add to product recommendation widgets
## 
## 🎯 Expected Impact:
##    If this bundle converts even 20% of applicable carts,
##    you could influence ~ 11  transactions.

10. Exporting Results for Business Team

cat("\n=== STEP 9: Exporting Results ===\n")
## 
## === STEP 9: Exporting Results ===
# Convert rules to dataframe
rules_df <- as(bundle_rules_clean, "data.frame")
rules_df <- rules_df[order(-rules_df$lift), ]

# Add business-friendly names
rules_df$bundle_name <- paste0("Bundle_", 1:nrow(rules_df))

# Save to CSV
setwd("C:/Users/Asus/Downloads")

dir.create("outputs", showWarnings = FALSE)

write.csv(
  rules_df,
  "outputs/product_bundles_recommendations.csv",
  row.names = FALSE
)


# Create summary report
summary_stats <- data.frame(
  Metric = c("Total Transactions Analyzed",
             "Total Products",
             "Average Basket Size",
             "Rules Generated (Initial)",
             "High-Lift Rules (>1.2)",
             "Final Bundle Recommendations",
             "Top Bundle Lift",
             "Average Confidence of Top 10"),
  Value = c(length(trans),
            length(itemLabels(trans)),
            round(mean(size(trans)), 2),
            length(rules_all),
            length(rules_high_lift),
            length(bundle_rules_clean),
            round(max(quality(bundle_rules_clean)$lift), 2),
            round(mean(quality(head(bundle_rules_clean, 10))$confidence) * 100, 1))
)

write.csv(summary_stats, "analysis_summary.csv", row.names = FALSE)
cat("✓ Saved: analysis_summary.csv\n")
## ✓ Saved: analysis_summary.csv

11. Additional Item-Based Recommendations

cat("\n=== BONUS: Product Affinity Matrix ===\n")
## 
## === BONUS: Product Affinity Matrix ===
# Cross-tabulation of products
cross_tab <- crossTable(trans, measure = "lift", sort = TRUE)
cat("\nTop 5x5 Product Affinities (Lift):\n")
## 
## Top 5x5 Product Affinities (Lift):
print(round(cross_tab[1:5, 1:5], 2))
##          Banana Cheese Bacon Hazelnut Honey
## Banana       NA   1.13  1.25     1.18  1.14
## Cheese     1.13     NA  1.17     1.11  1.18
## Bacon      1.25   1.17    NA     1.23  1.13
## Hazelnut   1.18   1.11  1.23       NA  1.10
## Honey      1.14   1.18  1.13     1.10    NA
# Find items that are frequently bought together
freq_itemsets <- eclat(trans, 
                       parameter = list(supp = 0.01, maxlen = 3),
                       control = list(verbose = FALSE))

cat("\nTop 10 Frequent Itemsets (2-3 products):\n")
## 
## Top 10 Frequent Itemsets (2-3 products):
inspect(head(sort(freq_itemsets, by = "support"), 10))
##      items         support   count
## [1]  {Banana}      0.4482759 208  
## [2]  {Cheese}      0.4439655 206  
## [3]  {Bacon}       0.4310345 200  
## [4]  {Hazelnut}    0.4202586 195  
## [5]  {HeavyCream}  0.4159483 193  
## [6]  {Honey}       0.4159483 193  
## [7]  {Carrot}      0.4137931 192  
## [8]  {Bread}       0.4073276 189  
## [9]  {ShavingFoam} 0.4051724 188  
## [10] {Apple}       0.4051724 188

12. Hierarchical Clustering of Products

cat("\n=== Product Clustering Analysis ===\n")
## 
## === Product Clustering Analysis ===
# Dissimilarity matrix for products
item_dissim <- dissimilarity(trans, which = "items", method = "jaccard")

# Hierarchical clustering
hc_items <- hclust(item_dissim, method = "ward.D2")

# Plot dendrogram
plot(hc_items, 
     main = "Product Clustering Dendrogram",
     xlab = "Products",
     sub = "Based on Co-occurrence Patterns")
rect.hclust(hc_items, k = 5, border = "red")

cat("\nProducts have been clustered into natural groups.\n")
## 
## Products have been clustered into natural groups.
cat("Recomendation for businesses: use these clusters for category-based bundle strategies\n")
## Recomendation for businesses: use these clusters for category-based bundle strategies

The dendrogram reveals five hierarchical product clusters formed on the basis of Jaccard co-occurrence similarity, indicating how frequently items appear together within transactions. Products joined at lower linkage heights exhibit stronger co-purchase affinity—these tight sub-branches reflect natural basket companions such as dairy–bakery or meat–pantry combinations, where joint consumption drives repeated pairing.

Broader clusters formed at higher linkage distances represent looser, cross-category co-occurrence, capturing mixed shopping missions that combine grocery staples with household or personal care goods. From a commercial standpoint, the structure confirms the existence of both high-synergy core bundles (tight clusters) and peripheral add-on products (distant branches), supporting category-based merchandising and cluster-driven bundle design.

13. Conclusions & Strategic Business Recommendations

This market basket analysis has revealed several actionable, product-level commercial insights:

  1. Strong Product Associations: Identified {1149} statistically significant bundle opportunities, with lift values ranging from 1.2 to {1.64}, confirming meaningful cross-selling potential beyond random co-occurrence.

  2. Natural Product Clusters: Hierarchical clustering revealed 5 affinity groups, with the strongest co-purchase ecosystems concentrated around dairy, breakfast staples, meat products, and household essentials.

  3. High-Value Cross-Selling Anchors:
    The most commercially influential bundle drivers include:

    • Butter ↔︎ Bacon ↔︎ Cheese — highest lift, high-frequency profit core
    • Milk ↔︎ Bread ↔︎ Cheese — staple breakfast cluster
    • Heavy Cream ↔︎ Honey ↔︎ Hazelnut — premium / specialty pairing tier
    • Toothpaste ↔︎ Shampoo — cross-category household linkage
    • Banana ↔︎ Apple ↔︎ Bread — high-volume fresh basket drivers

    These products function as basket anchors capable of increasing attachment rates and bundle profitability.


13.1 Implementation Roadmap

Phase 1: Quick Wins (Weeks 1–4)

Focus: High-demand, high-profit staple bundles

  • Deploy Butter + Bacon + Cheese premium breakfast bundle
  • Launch Milk + Bread + Cheese “Daily Essentials Pack”
  • Promote Banana + Apple + Bread value fruit-and-bakery bundle
  • Feature Heavy Cream + Honey as a premium add-on pairing
  • Position bundles in high-traffic store zones and homepage banners

Phase 2: Strategic Expansion (Months 2–3)

Focus: Cross-category and specialty margin expansion

  • Introduce Hazelnut + Honey + Heavy Cream specialty dessert bundle
  • Create Toothpaste + Shampoo personal care multipacks
  • Cross-merchandise Meat + Cheese + Butter for meal preparation baskets
  • Reorganize shelves to reflect dairy–bakery and meat–pantry clusters
  • Apply targeted discounts (5–10%) to mid-tier lift bundles

Phase 3: Optimization & Scale (Months 4–6)

Focus: Testing and profit maximisation

  • Trial niche bundles such as:
    • Carrot + Hazelnut specialty cooking pairings
    • Shaving Foam + Shampoo grooming kits
  • Refine bundles based on conversion and margin contribution
  • Integrate affinity bundles into recommendation engines
  • Align supply chain planning with high-lift product pairings

Measuring Success

Key Performance Indicators (KPIs) to track:

  • Conversion Rate — Uptake of bundles such as Butter–Bacon–Cheese
  • Average Basket Value — Uplift from staple and premium bundles
  • Attachment Rate — % of Milk buyers also purchasing Bread/Cheese
  • Incremental Revenue — Margin added by cross-selling dairy & meat anchors
  • Inventory Turnover — Improved rotation of bundled goods
  • Customer Satisfaction — Perceived convenience of curated bundles

14. Limitations

Current Limitations:

  • Analysis is based on historical data; patterns may evolve over time
  • Seasonal variations not explicitly modeled
  • Customer segmentation not incorporated (all transactions treated equally)
  • Price sensitivity not analyzed

15. Recommendations for Future Research

Recommended Future Analysis:

  1. Temporal Analysis: Identify seasonal patterns and evolving trends
  2. Customer Segmentation: Develop persona-specific bundle strategies
  3. Price Optimization: Determine optimal bundle discount levels
  4. Causal Analysis: Distinguish correlation from causation using experimental data
  5. Competitive Analysis: Benchmark against industry standards

References

  • Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. Proceedings of the 20th VLDB Conference, 487-499.

  • Hahsler, M., Grün, B., & Hornik, K. (2005). arules – A computational environment for mining association rules and frequent item sets. Journal of Statistical Software, 14(15), 1-25.

  • Hahsler, M., & Chelluboina, S. (2011). Visualizing association rules: Introduction to the R-extension package arulesViz. R Project Module, 223-238.