Market Basket Analysis - Groceries Dataset

Introduction

This project uses the Groceries dataset to perform Market Basket Analysis. The aim is to discover interesting association rules between purchased items.

Load Libraries

Load Data

data_path <- "C:/Users/Admin/Downloads/GroceryDataSet.csv"
data <- read.csv(data_path, header = FALSE, stringsAsFactors = FALSE)

# Display summary of raw data
summary(data) %>%
  kable(caption = "Summary of Raw Grocery Dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)

Summary of Raw Grocery Dataset
V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	V15	V16	V17	V18	V19	V20	V21	V22	V23	V24	V25	V26	V27	V28	V29	V30	V31	V32
Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835	Length:9835
Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character
Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character

# Load transactions for market basket analysis
trans <- read.transactions(data_path, format = "basket", sep = ",")

# Summary of transaction data
summary(trans)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

Based on the summary output, I observed that ‘whole milk’ appears most frequently across the transactions, with a count of 2,513. This is followed by ‘other vegetables’ with 1,903 occurrences, and then ‘rolls/buns’ and ‘soda’ with 1,809 and 1,715 respectively. These high counts suggest that these items are staples in many customer baskets. To better understand the distribution visually, I’ll use an item frequency plot to highlight the top purchased items.

itemFrequencyPlot(trans,
                  topN = 10,
                  type = "absolute",
                  col = brewer.pal(8, "Pastel1"),
                  main = "Top 10 Items by Absolute Frequency")

The frequency plot above provides a much clearer visual representation of the top items purchased. I used the itemFrequencyPlot function to create this bar chart based on the transaction data stored in the itemMatrix. This approach helps highlight the most common items across all transactions and makes it easier to compare their absolute frequencies at a glance.

Train/ Extract

To extract meaningful association rules, I trained the Apriori algorithm by setting minimum thresholds for support and confidence. These parameters help filter out less relevant combinations and focus on rules that reflect patterns with a higher likelihood of co-occurrence. In this context, support represents how frequently an item or itemset appears in the dataset, while confidence measures the reliability of the inference made by a rule.

# Set minimum support: 42 transactions out of the total
min_support <- 42 / length(trans)
min_support

## [1] 0.004270463

Generate Association Rules

# Generate rules using apriori
rules <- apriori(trans, parameter = list(supp = min_support, conf = 0.5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime     support minlen
##         0.5    0.1    1 none FALSE            TRUE       5 0.004270463      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 42 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [124 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.01s].
## writing ... [177 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

# Sort by lift
rules <- sort(rules, by = "lift", decreasing = TRUE)

# View top 10 rules
top_rules <- head(rules, 10)
inspect(top_rules)

##      lhs                      rhs                    support confidence    coverage     lift count
## [1]  {citrus fruit,                                                                               
##       root vegetables,                                                                            
##       tropical fruit}      => {other vegetables} 0.004473818  0.7857143 0.005693950 4.060694    44
## [2]  {tropical fruit,                                                                             
##       whipped/sour cream,                                                                         
##       whole milk}          => {yogurt}           0.004372140  0.5512821 0.007930859 3.951792    43
## [3]  {curd,                                                                                       
##       tropical fruit}      => {yogurt}           0.005287239  0.5148515 0.010269446 3.690645    52
## [4]  {citrus fruit,                                                                               
##       root vegetables,                                                                            
##       whole milk}          => {other vegetables} 0.005795628  0.6333333 0.009150991 3.273165    57
## [5]  {pip fruit,                                                                                  
##       root vegetables,                                                                            
##       whole milk}          => {other vegetables} 0.005490595  0.6136364 0.008947636 3.171368    54
## [6]  {root vegetables,                                                                            
##       tropical fruit,                                                                             
##       yogurt}              => {other vegetables} 0.004982206  0.6125000 0.008134215 3.165495    49
## [7]  {pip fruit,                                                                                  
##       whipped/sour cream}  => {other vegetables} 0.005592272  0.6043956 0.009252669 3.123610    55
## [8]  {onions,                                                                                     
##       root vegetables}     => {other vegetables} 0.005693950  0.6021505 0.009456024 3.112008    56
## [9]  {cream cheese,                                                                               
##       root vegetables}     => {other vegetables} 0.004473818  0.5945946 0.007524148 3.072957    44
## [10] {beef,                                                                                       
##       tropical fruit}      => {other vegetables} 0.004473818  0.5866667 0.007625826 3.031985    44

The table above displays the top 10 association rules discovered through the Apriori algorithm, ranked by their lift values. Lift helps quantify the strength of a rule by measuring how much more frequently the items in the rule appear together than would be expected if they were occurring independently. A lift greater than 1 indicates that the items are positively associated, meaning the presence of items on the left-hand side of the rule increases the likelihood of seeing the item(s) on the right-hand side in the same transaction.

Rule Metrics Table

# Convert rules to a data frame for tabular view
rules_df <- as(rules, "data.frame")

# Show top 10 rules by support, confidence, and lift
rules_df %>%
  head(10) %>%
  select(support, confidence, lift) %>%
  kable(caption = "Top 10 Rules by Support, Confidence, and Lift") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F)

Top 10 Rules by Support, Confidence, and Lift
	support	confidence	lift
157	0.0044738	0.7857143	4.060694
144	0.0043721	0.5512821	3.951792
46	0.0052872	0.5148515	3.690645
161	0.0057956	0.6333333	3.273165
154	0.0054906	0.6136364	3.171368
163	0.0049822	0.6125000	3.165495
101	0.0055923	0.6043956	3.123610
11	0.0056940	0.6021505	3.112008
21	0.0044738	0.5945946	3.072957
37	0.0044738	0.5866667	3.031985

Visualize Association Rules

# Visualize the top rules
plot(top_rules, method = "graph", engine = "htmlwidget")

plot(top_rules, method = "grouped")

The grouped matrix visualization above illustrates how different items on the left-hand side of the rules are associated with specific outcomes on the right-hand side. Each bubble represents a rule, with its size indicating the level of support and its color intensity representing the lift. This view makes it easier to spot which item combinations consistently lead to a specific product being purchased and how strong those associations are compared to random chance.

Extra Credit: Simple Cluster Analysis

# Convert transactions to a binary matrix
item_matrix <- as(trans, "matrix")

# Select top 50 most frequent items
top_items <- sort(colSums(item_matrix), decreasing = TRUE)
top_item_matrix <- item_matrix[, names(top_items)[1:50]]

# Apply k-means clustering with 3 clusters
set.seed(123)
clust <- kmeans(top_item_matrix, centers = 3)

# View cluster distribution
table(clust$cluster)

## 
##    1    2    3 
## 2300 5030 2505

The clustering groups transactions with similar item patterns into three segments, which could help a retailer target specific types of shopping behavior with promotions or personalized recommendations.

Conclusion

This market basket analysis revealed valuable patterns within the grocery transaction data. Frequently purchased items like whole milk, vegetables, and bakery products often appeared together, highlighting consistent shopping habits across customers. By applying the Apriori algorithm, I was able to uncover strong association rules with notable confidence and lift, indicating reliable and meaningful product relationships. The additional clustering analysis offered deeper insight into shopper segments, which can support more targeted inventory decisions, personalized promotions, and strategic product placement.