Grocery Basket Analysis

Objectives

The arrangement of products in grocery stores is strategically designed, not randomly, to encourage the purchase of related items or prompt consumers to buy frequently forgotten products. This arrangement not only boosts sales but also enhances convenience, allowing customers to collect all their necessary items within a specific area, reducing the need for additional trips. Through this grocery basket analysis, we gain valuable insights into what consumers typically purchase alongside other items, providing a deeper understanding of purchasing behaviors and preferences.

# Load required packages
library(arules)
library(arulesViz)
library(ggplot2)
library(dplyr)

Data Import & Preparation

# Load transaction data
trans1 <- read.transactions("Grocery Products Purchase.csv", format="basket", sep=",", skip=0)
head(trans1)
## transactions in sparse format with
##  6 transactions (rows) and
##  201 items (columns)
summary(trans1)
## transactions as itemMatrix in sparse format with
##  9836 rows (elements/itemsets/transactions) and
##  201 columns (items) and a density of 0.02195155 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34087 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    2 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.412   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics
length(trans1)
## [1] 9836

Interpretation: The most frequent items in the dataset are: Whole milk (2513 occurrences), Other vegetables (1903 occurrences), Rolls/buns (1809 occurrences), Soda (1715 occurrences), Yogurt (1372 occurrences) and Other (34087 occurrences).

Exploratory Data Analysis

# Frequency of items (support and absolute count)
item_freq <- itemFrequency(trans1, type = "absolute")
item_freq_df <- data.frame(item = names(item_freq), frequency = item_freq) %>% 
  arrange(desc(frequency))
summary(item_freq)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    16.0    78.0   215.9   256.0  2513.0
# Plot top 10 items by frequency
ggplot(item_freq_df[1:10,], aes(x = reorder(item, frequency), y = frequency)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  coord_flip() +
  labs(title = "Top 10 Most Frequently Purchased Items", x = "Items", y = "Frequency")

Interpretation: The bar chart highlights the TOP 10 most frequently purchased items, allowing us to identify staple products in customer baskets. For instances, whole milk has the highest frequency of purchase whereas sausage has the lowest frequency of purchase.

Market Basket Analysis Using Apriori Algorithm

# Generate association rules with minimum support = 0.005 and confidence = 0.1
rules <- apriori(trans1, parameter = list(supp = 0.005, conf = 0.1)) 
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.1    0.1    1 none FALSE            TRUE       5   0.005      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 49 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[201 item(s), 9836 transaction(s)] done [0.00s].
## sorting and recoding items ... [120 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [1582 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Remove redundant rules and filter significant rules
rules.clean <- rules[!is.redundant(rules)] 
if (length(rules.clean) > 0) {
  rules.clean <- rules.clean[is.significant(rules.clean, trans1)] 
  rules.clean <- rules.clean[is.maximal(rules.clean)]
} else {
  print("No significant rules generated. Consider lowering support and confidence.")
}

# Inspect top rules by confidence
if (length(rules.clean) > 0) {
  inspect(head(sort(rules.clean, by = "lift", decreasing = TRUE)))
}
##     lhs                    rhs                      support confidence   coverage     lift count
## [1] {ham}               => {white bread}        0.005083367  0.1953125 0.02602684 4.640323    50
## [2] {white bread}       => {ham}                0.005083367  0.1207729 0.04209028 4.640323    50
## [3] {citrus fruit,                                                                              
##      other vegetables,                                                                          
##      whole milk}        => {root vegetables}    0.005795039  0.4453125 0.01301342 4.085908    57
## [4] {butter,                                                                                    
##      other vegetables}  => {whipped/sour cream} 0.005795039  0.2893401 0.02002847 4.036807    57
## [5] {herbs}             => {root vegetables}    0.007015047  0.4312500 0.01626678 3.956880    69
## [6] {other vegetables,                                                                          
##      root vegetables}   => {onions}             0.005693371  0.1201717 0.04737698 3.875438    56

Interpretation: The Apriori algorithm generates rules indicating which products are frequently bought together. Rules with higher lift values suggest stronger associations between items. For instance, ham and white bread have the highest lift, it means that customers are more likely to purchase both together, indicating a potential for cross-promotions.

Filtering & Visualizing Strong Rules

# Filter rules with lift > 1.5
strong_rules <- subset(rules.clean, lift > 1.5)

# Inspect strong rules
if (length(strong_rules) > 0) {
  inspect(head(strong_rules))
} else {
  print("No strong rules found with lift > 1.5.")
}
##     lhs                rhs                support     confidence coverage  
## [1] {cake bar}      => {whole milk}       0.005591704 0.4230769  0.01321675
## [2] {dishes}        => {other vegetables} 0.005998373 0.3410405  0.01758845
## [3] {mustard}       => {whole milk}       0.005185035 0.4322034  0.01199675
## [4] {potted plants} => {whole milk}       0.006913379 0.4000000  0.01728345
## [5] {canned fish}   => {other vegetables} 0.005083367 0.3378378  0.01504677
## [6] {pasta}         => {whole milk}       0.006100041 0.4054054  0.01504677
##     lift     count
## [1] 1.655943 55   
## [2] 1.762729 59   
## [3] 1.691664 51   
## [4] 1.565619 68   
## [5] 1.746176 50   
## [6] 1.586776 60
# Scatterplot of confidence vs lift
plot(rules.clean, method="scatterplot", measure=c("confidence", "lift"), shading="support")

Interpretation: Rules with a lift greater than 1.5 indicate strong positive associations between items. The scatterplot helps visualize the relationship between confidence and lift, highlighting the most impactful rules. In this result, cake bar (LHS) and whole milk (RHS) have the highest lift, which indicates that they have the strongest positive associations.

Jaccard Similarity & Clustering

set.seed(223)
sample_trans <- sample(1:length(trans1), size = 20)
d.jac.t <- dissimilarity(trans1[sample_trans, ], which = "transactions")
d.jac.i <- dissimilarity(trans1[sample_trans, ], which = "items")

# Dendrograms for transactions and items
par(mfrow = c(1,2))
plot(hclust(d.jac.t, method = "ward.D2"), main = "Dendrogram for Sampled Transactions")
plot(hclust(d.jac.i, method = "ward.D2"), main = "Dendrogram for Sampled Items")

par(mfrow = c(1,1))

Interpretation: Jaccard similarity is used to measure how similar transactions and items are. Dendrograms visually represent clusters of similar purchases, which can help in segmentation and marketing strategies.

Basket Size Analysis

# Analyze basket size distribution
basket_size <- size(trans1)
ggplot(data.frame(size = basket_size), aes(x = size)) +
  geom_histogram(binwidth = 1, fill = "purple", alpha = 0.7, color = "black") +
  labs(title = "Distribution of Basket Size", x = "Number of Items in Basket", y = "Frequency") +
  theme_minimal()

# Print summary statistics
cat("Average basket size:", mean(basket_size), "\n")
## Average basket size: 4.412261
cat("Maximum basket size:", max(basket_size), "\n")
## Maximum basket size: 32

Interpretation: This analysis provides insights into how many items customers typically purchase in one transaction, which can be useful for inventory and pricing strategies.

Rule Aggregation & Higher-Level Analysis

# Get the item labels from the transaction data (trans1)
existing_items <- itemLabels(trans1)

# Define item categories based on existing item labels
# Ensure the length of names.real and names.level1 matches the number of existing items
names.real <- c("pork", "margarine", "butter", "citrus fruit", "other vegetables", 
                "root vegetables", "whole milk", "yogurt", "rolls", "beef")
names.level1 <- c("meat", "dairy", "dairy", "fruits", "vegetables", 
                  "vegetables", "dairy", "dairy", "bakery", "meat")

# Ensure these categories match the actual item labels in your dataset
# Create a category mapping for all items (expand as needed)
category_mapping <- data.frame(labels = existing_items, 
                               level1 = rep("uncategorized", length(existing_items)))

# Assign categories based on predefined keywords or manual mappings
category_mapping$level1[grepl("milk|butter|yogurt|cheese", category_mapping$labels)] <- "dairy"
category_mapping$level1[grepl("beef|chicken|pork", category_mapping$labels)] <- "meat"
category_mapping$level1[grepl("apple|banana|citrus|fruit", category_mapping$labels)] <- "fruits"
category_mapping$level1[grepl("vegetable|carrot|broccoli", category_mapping$labels)] <- "vegetables"
category_mapping$level1[grepl("bread|rolls|bakery", category_mapping$labels)] <- "bakery"

# Assign the new categories to the transaction data
itemInfo(trans1) <- category_mapping

# Aggregate transactions by the assigned category (level1)
trans1_agg <- aggregate(trans1, by = "level1")

# Run Apriori on the aggregated transactions
rules_agg <- apriori(trans1_agg, parameter = list(supp = 0.02, conf = 0.3))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5    0.02      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 196 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[6 item(s), 9836 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [129 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
inspect(head(rules_agg))
##     lhs       rhs             support    confidence coverage  lift     count
## [1] {}     => {bakery}        0.30368036 0.3036804  1.0000000 1.000000 2987 
## [2] {}     => {vegetables}    0.34089061 0.3408906  1.0000000 1.000000 3353 
## [3] {}     => {dairy}         0.44825132 0.4482513  1.0000000 1.000000 4409 
## [4] {}     => {uncategorized} 0.88796259 0.8879626  1.0000000 1.000000 8734 
## [5] {meat} => {fruits}        0.04300529 0.3152012  0.1364376 1.448747  423 
## [6] {meat} => {bakery}        0.05073200 0.3718331  0.1364376 1.224423  499
# Visualize aggregated rules
plot(rules_agg, method = "graph", engine = "htmlwidget")

Interpretation: Aggregating items into higher-level categories allows us to observe broader purchasing patterns. This approach is useful when individual items have low support but exhibit strong category-level associations.

  • High Support Items: Items like bakery, vegetables, dairy are frequently bought. These items have high support values (over 30% of transactions contain these items).

  • Relationships Between Categories:

    • meat is associated with both fruits and vegetables, suggesting that these categories might often be purchased together, albeit with lower frequency.

    • bakery is associated with vegetables and dairy, indicating possible shopping basket combinations.

  • Interpretation of Lift:

    • A lift greater than 1 indicates a positive relationship between items. For example, meat => fruits has a lift of 1.45, which suggests that customers who purchase meat are more likely to purchase fruits than would be expected by chance.