The arrangement of products in grocery stores is strategically designed, not randomly, to encourage the purchase of related items or prompt consumers to buy frequently forgotten products. This arrangement not only boosts sales but also enhances convenience, allowing customers to collect all their necessary items within a specific area, reducing the need for additional trips. Through this grocery basket analysis, we gain valuable insights into what consumers typically purchase alongside other items, providing a deeper understanding of purchasing behaviors and preferences.
# Load transaction data
trans1 <- read.transactions("Grocery Products Purchase.csv", format="basket", sep=",", skip=0)
head(trans1)## transactions in sparse format with
## 6 transactions (rows) and
## 201 items (columns)
## transactions as itemMatrix in sparse format with
## 9836 rows (elements/itemsets/transactions) and
## 201 columns (items) and a density of 0.02195155
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34087
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.412 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
## [1] 9836
Interpretation: The most frequent items in the dataset are: Whole milk (2513 occurrences), Other vegetables (1903 occurrences), Rolls/buns (1809 occurrences), Soda (1715 occurrences), Yogurt (1372 occurrences) and Other (34087 occurrences).
# Frequency of items (support and absolute count)
item_freq <- itemFrequency(trans1, type = "absolute")
item_freq_df <- data.frame(item = names(item_freq), frequency = item_freq) %>%
arrange(desc(frequency))
summary(item_freq)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 16.0 78.0 215.9 256.0 2513.0
# Plot top 10 items by frequency
ggplot(item_freq_df[1:10,], aes(x = reorder(item, frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "lightblue") +
coord_flip() +
labs(title = "Top 10 Most Frequently Purchased Items", x = "Items", y = "Frequency")Interpretation: The bar chart highlights the TOP 10 most frequently purchased items, allowing us to identify staple products in customer baskets. For instances, whole milk has the highest frequency of purchase whereas sausage has the lowest frequency of purchase.
# Generate association rules with minimum support = 0.005 and confidence = 0.1
rules <- apriori(trans1, parameter = list(supp = 0.005, conf = 0.1)) ## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.1 0.1 1 none FALSE TRUE 5 0.005 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 49
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[201 item(s), 9836 transaction(s)] done [0.00s].
## sorting and recoding items ... [120 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [1582 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Remove redundant rules and filter significant rules
rules.clean <- rules[!is.redundant(rules)]
if (length(rules.clean) > 0) {
rules.clean <- rules.clean[is.significant(rules.clean, trans1)]
rules.clean <- rules.clean[is.maximal(rules.clean)]
} else {
print("No significant rules generated. Consider lowering support and confidence.")
}
# Inspect top rules by confidence
if (length(rules.clean) > 0) {
inspect(head(sort(rules.clean, by = "lift", decreasing = TRUE)))
}## lhs rhs support confidence coverage lift count
## [1] {ham} => {white bread} 0.005083367 0.1953125 0.02602684 4.640323 50
## [2] {white bread} => {ham} 0.005083367 0.1207729 0.04209028 4.640323 50
## [3] {citrus fruit,
## other vegetables,
## whole milk} => {root vegetables} 0.005795039 0.4453125 0.01301342 4.085908 57
## [4] {butter,
## other vegetables} => {whipped/sour cream} 0.005795039 0.2893401 0.02002847 4.036807 57
## [5] {herbs} => {root vegetables} 0.007015047 0.4312500 0.01626678 3.956880 69
## [6] {other vegetables,
## root vegetables} => {onions} 0.005693371 0.1201717 0.04737698 3.875438 56
Interpretation: The Apriori algorithm generates rules indicating which products are frequently bought together. Rules with higher lift values suggest stronger associations between items. For instance, ham and white bread have the highest lift, it means that customers are more likely to purchase both together, indicating a potential for cross-promotions.
# Filter rules with lift > 1.5
strong_rules <- subset(rules.clean, lift > 1.5)
# Inspect strong rules
if (length(strong_rules) > 0) {
inspect(head(strong_rules))
} else {
print("No strong rules found with lift > 1.5.")
}## lhs rhs support confidence coverage
## [1] {cake bar} => {whole milk} 0.005591704 0.4230769 0.01321675
## [2] {dishes} => {other vegetables} 0.005998373 0.3410405 0.01758845
## [3] {mustard} => {whole milk} 0.005185035 0.4322034 0.01199675
## [4] {potted plants} => {whole milk} 0.006913379 0.4000000 0.01728345
## [5] {canned fish} => {other vegetables} 0.005083367 0.3378378 0.01504677
## [6] {pasta} => {whole milk} 0.006100041 0.4054054 0.01504677
## lift count
## [1] 1.655943 55
## [2] 1.762729 59
## [3] 1.691664 51
## [4] 1.565619 68
## [5] 1.746176 50
## [6] 1.586776 60
# Scatterplot of confidence vs lift
plot(rules.clean, method="scatterplot", measure=c("confidence", "lift"), shading="support")Interpretation: Rules with a lift greater than 1.5 indicate strong positive associations between items. The scatterplot helps visualize the relationship between confidence and lift, highlighting the most impactful rules. In this result, cake bar (LHS) and whole milk (RHS) have the highest lift, which indicates that they have the strongest positive associations.
set.seed(223)
sample_trans <- sample(1:length(trans1), size = 20)
d.jac.t <- dissimilarity(trans1[sample_trans, ], which = "transactions")
d.jac.i <- dissimilarity(trans1[sample_trans, ], which = "items")
# Dendrograms for transactions and items
par(mfrow = c(1,2))
plot(hclust(d.jac.t, method = "ward.D2"), main = "Dendrogram for Sampled Transactions")
plot(hclust(d.jac.i, method = "ward.D2"), main = "Dendrogram for Sampled Items")Interpretation: Jaccard similarity is used to measure how similar transactions and items are. Dendrograms visually represent clusters of similar purchases, which can help in segmentation and marketing strategies.
# Analyze basket size distribution
basket_size <- size(trans1)
ggplot(data.frame(size = basket_size), aes(x = size)) +
geom_histogram(binwidth = 1, fill = "purple", alpha = 0.7, color = "black") +
labs(title = "Distribution of Basket Size", x = "Number of Items in Basket", y = "Frequency") +
theme_minimal()## Average basket size: 4.412261
## Maximum basket size: 32
Interpretation: This analysis provides insights into how many items customers typically purchase in one transaction, which can be useful for inventory and pricing strategies.
# Get the item labels from the transaction data (trans1)
existing_items <- itemLabels(trans1)
# Define item categories based on existing item labels
# Ensure the length of names.real and names.level1 matches the number of existing items
names.real <- c("pork", "margarine", "butter", "citrus fruit", "other vegetables",
"root vegetables", "whole milk", "yogurt", "rolls", "beef")
names.level1 <- c("meat", "dairy", "dairy", "fruits", "vegetables",
"vegetables", "dairy", "dairy", "bakery", "meat")
# Ensure these categories match the actual item labels in your dataset
# Create a category mapping for all items (expand as needed)
category_mapping <- data.frame(labels = existing_items,
level1 = rep("uncategorized", length(existing_items)))
# Assign categories based on predefined keywords or manual mappings
category_mapping$level1[grepl("milk|butter|yogurt|cheese", category_mapping$labels)] <- "dairy"
category_mapping$level1[grepl("beef|chicken|pork", category_mapping$labels)] <- "meat"
category_mapping$level1[grepl("apple|banana|citrus|fruit", category_mapping$labels)] <- "fruits"
category_mapping$level1[grepl("vegetable|carrot|broccoli", category_mapping$labels)] <- "vegetables"
category_mapping$level1[grepl("bread|rolls|bakery", category_mapping$labels)] <- "bakery"
# Assign the new categories to the transaction data
itemInfo(trans1) <- category_mapping
# Aggregate transactions by the assigned category (level1)
trans1_agg <- aggregate(trans1, by = "level1")
# Run Apriori on the aggregated transactions
rules_agg <- apriori(trans1_agg, parameter = list(supp = 0.02, conf = 0.3))## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.3 0.1 1 none FALSE TRUE 5 0.02 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 196
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[6 item(s), 9836 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [129 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
## lhs rhs support confidence coverage lift count
## [1] {} => {bakery} 0.30368036 0.3036804 1.0000000 1.000000 2987
## [2] {} => {vegetables} 0.34089061 0.3408906 1.0000000 1.000000 3353
## [3] {} => {dairy} 0.44825132 0.4482513 1.0000000 1.000000 4409
## [4] {} => {uncategorized} 0.88796259 0.8879626 1.0000000 1.000000 8734
## [5] {meat} => {fruits} 0.04300529 0.3152012 0.1364376 1.448747 423
## [6] {meat} => {bakery} 0.05073200 0.3718331 0.1364376 1.224423 499
Interpretation: Aggregating items into higher-level categories allows us to observe broader purchasing patterns. This approach is useful when individual items have low support but exhibit strong category-level associations.
High Support Items: Items like bakery, vegetables, dairy are frequently bought. These items have high support values (over 30% of transactions contain these items).
Relationships Between Categories:
meat is associated with both fruits and vegetables, suggesting that these categories might often be purchased together, albeit with lower frequency.
bakery is associated with vegetables and dairy, indicating possible shopping basket combinations.
Interpretation of Lift:
meat => fruits has a
lift of 1.45, which suggests that customers who purchase meat are more
likely to purchase fruits than would be expected by chance.