1 Introduction

The objective of this analysis is to implement Market Basket Analysis (MBA) using the arules package in R. We explore a dataset containing 9,835 grocery transactions to identify frequent itemsets and generate association rules. This technique allows us to understand customer behavior by identifying products that are frequently purchased together.

2 Data Preparation and Exploratory Analysis

Before mining rules, we must transform the raw CSV data into a formal transactions object. This format is optimized for the Apriori algorithm, treating each row as a discrete shopping event.

library(arules)
library(arulesViz)
library(ggplot2)

# Loading the dataset
groceries <- read.transactions("GroceryDataSet.csv", format = "basket", sep = ",")

# Visualizing the top 10 most frequent items
itemFrequencyPlot(groceries, topN = 10, type = "relative", 
                  col = "steelblue", main = "Relative Item Frequency")

Preliminary exploration shows that “whole milk” and “other vegetables” are the most frequent items. This suggests that basic dairy and produce are the core drivers of traffic in this retail environment.

3 Association Rule Mining (Apriori Algorithm)

To find meaningful associations, we set a minimum Support of 0.005 (items appearing in at least 0.5% of transactions) and a Confidence threshold of 0.5.

# Applying the Apriori algorithm
rules <- apriori(groceries, parameter = list(supp = 0.005, conf = 0.5))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.005      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 49 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [120 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.01s].
## writing ... [120 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Sorting rules by Lift to identify the strongest associations
top_10_lift <- sort(rules, by = "lift", decreasing = TRUE)[1:10]

3.1 Top 10 Rules by Lift

The following table presents the strongest rules based on the Lift metric, which measures how much more likely item B is purchased when item A is present, compared to its base probability.

inspect(top_10_lift)
##      lhs                     rhs                    support confidence    coverage     lift count
## [1]  {curd,                                                                                      
##       tropical fruit}     => {yogurt}           0.005287239  0.5148515 0.010269446 3.690645    52
## [2]  {citrus fruit,                                                                              
##       root vegetables,                                                                           
##       whole milk}         => {other vegetables} 0.005795628  0.6333333 0.009150991 3.273165    57
## [3]  {pip fruit,                                                                                 
##       root vegetables,                                                                           
##       whole milk}         => {other vegetables} 0.005490595  0.6136364 0.008947636 3.171368    54
## [4]  {pip fruit,                                                                                 
##       whipped/sour cream} => {other vegetables} 0.005592272  0.6043956 0.009252669 3.123610    55
## [5]  {onions,                                                                                    
##       root vegetables}    => {other vegetables} 0.005693950  0.6021505 0.009456024 3.112008    56
## [6]  {citrus fruit,                                                                              
##       root vegetables}    => {other vegetables} 0.010371124  0.5862069 0.017691917 3.029608   102
## [7]  {root vegetables,                                                                           
##       tropical fruit,                                                                            
##       whole milk}         => {other vegetables} 0.007015760  0.5847458 0.011997966 3.022057    69
## [8]  {root vegetables,                                                                           
##       tropical fruit}     => {other vegetables} 0.012302999  0.5845411 0.021047280 3.020999   121
## [9]  {butter,                                                                                    
##       whipped/sour cream} => {other vegetables} 0.005795628  0.5700000 0.010167768 2.945849    57
## [10] {tropical fruit,                                                                            
##       whipped/sour cream} => {other vegetables} 0.007829181  0.5661765 0.013828165 2.926088    77

3.1.1 Discussion:

The analysis yielded highly actionable insights. The rule with the highest strength involves {curd, tropical fruit} => {yogurt}, with a Lift of 3.69. This indicates that a customer purchasing curd and tropical fruit is over 3.6 times more likely to buy yogurt than the average shopper.

We also observe a dominant pattern involving “other vegetables” as a consequence. Specifically, combinations of root vegetables with onions, citrus fruits, or whipped/sour cream consistently yield Lift values above 3.0. For instance, the association {onions, root vegetables} => {other vegetables} shows a confidence of 60.2%, meaning that in more than 6 out of 10 cases where onions and root vegetables are bought, “other vegetables” are also present in the basket.

4 Cluster Analysis

To complement the MBA, we implemented a K-Means clustering algorithm on the binary transaction matrix to segment shopping behaviors.

# Convert to binary matrix
transactions_matrix <- as(groceries, "matrix")

# K-means with 3 clusters
set.seed(42)
km_clusters <- kmeans(transactions_matrix, centers = 3)

# Summary of clusters
table(km_clusters$cluster)
## 
##    1    2    3 
## 6559  954 2322

4.0.1 Discussion of Clustering Results:

The segmentation produced three distinct groups with significant differences in volume:

Cluster 1 (6,559 transactions): This is the largest group, likely representing “target-specific” or “small-basket” shoppers who purchase only a few items per visit.

Cluster 2 (954 transactions): The smallest group, which typically represents “bulk” or “heavy” shoppers. These are transactions with a high density of items across multiple categories.

Cluster 3 (2,322 transactions): A mid-sized group that likely captures “routine” or “mid-size” shopping trips, such as weekly replenishment of staples.

By combining clustering with association rules, we can see that while certain products are linked (MBA), those links are often driven by the high-density behavior seen in Cluster 2 and Cluster 3.

5 Conclusion

This report successfully identified the top 10 association rules for the Groceries dataset, prioritized by their Lift values. The most significant finding was the association between curd, tropical fruit, and yogurt (Lift: 3.69), suggesting a clear consumer preference for breakfast or health-oriented bundles. Additionally, the prevalence of “other vegetables” as a core component in the highest-lift rules confirms its status as a fundamental staple that anchors multiple shopping patterns.

From a strategic perspective, the clustering results complement these findings by showing that the majority of shoppers (Cluster 1) perform focused, low-volume transactions. Retailers should consider placing high-lift item pairs (like onions and root vegetables) in visible, accessible “proximity zones” to capture the interest of these quick-trip shoppers, while using the anchor items identified in the rules to guide the more extensive shopping trips found in Clusters 2 and 3.