Market Basket Analysis: Groceries Dataset

Overview

This assignment uses the Groceries dataset to perform Market Basket Analysis. Each row in the data represents one customer transaction, and each column contains an item that appeared on that receipt. The goal is to find association rules that show which grocery items are often purchased together.

For the association rule mining, I report the three main measures:

Support: how often the item combination appears in all transactions.
Confidence: how often the rule is correct when the left-hand side item(s) appear.
Lift: how much more likely the right-hand side item(s) are purchased when the left-hand side item(s) are purchased, compared to normal purchasing behavior.

A lift value greater than 1 means the items appear together more often than expected by chance.

Load Packages

# Run these install lines only if the packages are not already installed:
# install.packages("arules")
# install.packages("knitr")

library(arules)
library(knitr)

Load and Clean the Data

# Make sure GroceryDataSet.csv is saved in the same folder as this Rmd file.
raw_data <- read.csv(
  "GroceryDataSet.csv",
  header = FALSE,
  stringsAsFactors = FALSE,
  na.strings = c("", "NA")
)

# Convert each row into a basket/list of items.
# trimws() removes extra spaces, and unique() avoids duplicate items within the same transaction.
basket_list <- lapply(seq_len(nrow(raw_data)), function(i) {
  items <- as.character(raw_data[i, ])
  items <- trimws(items)
  items <- items[!is.na(items) & items != ""]
  unique(items)
})

# Convert the basket list into transactions for the arules package.
groceries <- as(basket_list, "transactions")

# Basic dataset information.
num_transactions <- length(groceries)
num_items <- length(itemLabels(groceries))

cat("Number of transactions:", num_transactions, "\n")

## Number of transactions: 9835

cat("Number of unique items:", num_items, "\n")

## Number of unique items: 169

Basic Basket Summary

basket_sizes <- size(groceries)

summary_table <- data.frame(
  Measure = c("Minimum basket size", "Median basket size", "Mean basket size", "Maximum basket size"),
  Value = c(
    min(basket_sizes),
    median(basket_sizes),
    round(mean(basket_sizes), 2),
    max(basket_sizes)
  )
)

kable(summary_table, caption = "Summary of Basket Sizes")

Summary of Basket Sizes
Measure	Value
Minimum basket size	1.00
Median basket size	3.00
Mean basket size	4.41
Maximum basket size	32.00

item_freq <- itemFrequency(groceries, type = "relative")
top_items <- sort(item_freq, decreasing = TRUE)[1:10]

top_items_table <- data.frame(
  Item = names(top_items),
  Support = round(as.numeric(top_items), 4),
  Percent_of_Transactions = round(as.numeric(top_items) * 100, 2)
)

kable(top_items_table, caption = "Top 10 Most Common Grocery Items")

Top 10 Most Common Grocery Items
Item	Support	Percent_of_Transactions
whole milk	0.2555	25.55
other vegetables	0.1935	19.35
rolls/buns	0.1839	18.39
soda	0.1744	17.44
yogurt	0.1395	13.95
bottled water	0.1105	11.05
root vegetables	0.1090	10.90
tropical fruit	0.1049	10.49
shopping bags	0.0985	9.85
sausage	0.0940	9.40

Association Rule Mining

For this analysis, I used a minimum support of 0.001 and a minimum confidence of 0.20. Since grocery baskets can contain many different item combinations, a low support threshold helps capture less common but still meaningful rules. I also limited the rules to a maximum length of 3 items to keep the final rules easier to interpret.

rules <- apriori(
  groceries,
  parameter = list(
    supp = 0.001,
    conf = 0.20,
    minlen = 2,
    maxlen = 3,
    target = "rules"
  )
)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5   0.001      2
##  maxlen target  ext
##       3  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3

##  done [0.00s].
## writing ... [9957 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

cat("Number of rules generated:", length(rules), "\n")

## Number of rules generated: 9957

Top 10 Rules by Lift

top10_rules <- sort(rules, by = "lift", decreasing = TRUE)[1:10]

q <- quality(top10_rules)

# Some versions of arules include count automatically. This makes the code work either way.
if (!"count" %in% names(q)) {
  q$count <- round(q$support * length(groceries))
}

top10_table <- data.frame(
  Rule = labels(top10_rules),
  Support = round(q$support, 4),
  Confidence = round(q$confidence, 4),
  Lift = round(q$lift, 3),
  Count = q$count
)

kable(top10_table, caption = "Top 10 Association Rules Sorted by Lift")

Top 10 Association Rules Sorted by Lift
Rule	Support	Confidence	Lift	Count
{bottled beer,red/blush wine} => {liquor}	0.0019	0.3958	35.716	19
{hamburger meat,soda} => {Instant food products}	0.0012	0.2105	26.209	12
{ham,white bread} => {processed cheese}	0.0019	0.3800	22.928	19
{bottled beer,liquor} => {red/blush wine}	0.0019	0.4130	21.494	19
{Instant food products,soda} => {hamburger meat}	0.0012	0.6316	18.996	12
{curd,sugar} => {flour}	0.0011	0.3235	18.608	11
{baking powder,sugar} => {flour}	0.0010	0.3125	17.973	10
{processed cheese,white bread} => {ham}	0.0019	0.4634	17.803	19
{fruit/vegetable juice,ham} => {processed cheese}	0.0011	0.2895	17.466	11
{margarine,sugar} => {flour}	0.0016	0.2963	17.041	16

Interpretation of Association Rules

The rules with the highest lift show item combinations that occur together much more often than expected by chance. For example, if a rule has a lift of 20, that means the item on the right-hand side is about 20 times more likely to appear when the item or items on the left-hand side appear, compared with a normal transaction.

The rules with the highest lift usually have low support because very specific combinations do not happen in a large percentage of all transactions. This does not mean the rules are useless; it means they should be interpreted as interesting purchasing patterns rather than broad trends across every customer.

From a business perspective, these rules could help with product placement, promotions, bundling, or understanding which items customers tend to buy together.

Extra Credit: Simple Cluster Analysis

For the extra credit section, I created a simple clustering analysis using the top 20 most common items. Each transaction was converted into a binary format where 1 means the item was purchased and 0 means it was not purchased. Then I used k-means clustering to group similar baskets together.

set.seed(123)

# Use only the top 20 most frequent items to keep the clustering simple and interpretable.
top20_items <- names(sort(itemFrequency(groceries), decreasing = TRUE))[1:20]

# Convert transactions to a binary matrix for only those top 20 items.
binary_matrix <- as(groceries[, top20_items], "matrix") * 1

# K-means clustering with 4 clusters.
km <- kmeans(binary_matrix, centers = 4, nstart = 25)

cluster_sizes <- table(km$cluster)
cluster_summary <- data.frame(
  Cluster = as.numeric(names(cluster_sizes)),
  Transactions = as.numeric(cluster_sizes),
  Percent = round(as.numeric(cluster_sizes) / length(groceries) * 100, 2)
)

kable(cluster_summary, caption = "Cluster Sizes")

Cluster Sizes
Cluster	Transactions	Percent
1	1789	18.19
2	1500	15.25
3	5017	51.01
4	1529	15.55

# Show the most common items within each cluster.
cluster_profiles <- do.call(rbind, lapply(sort(unique(km$cluster)), function(cluster_id) {
  cluster_data <- binary_matrix[km$cluster == cluster_id, , drop = FALSE]
  rates <- sort(colMeans(cluster_data), decreasing = TRUE)[1:8]
  data.frame(
    Cluster = cluster_id,
    Item = names(rates),
    Purchase_Rate = round(as.numeric(rates), 3)
  )
}))

kable(cluster_profiles, caption = "Most Common Items Within Each Cluster")

Most Common Items Within Each Cluster
Cluster	Item	Purchase_Rate
1	other vegetables	0.999
1	whole milk	0.411
1	root vegetables	0.262
1	yogurt	0.233
1	rolls/buns	0.215
1	tropical fruit	0.191
1	citrus fruit	0.157
1	whipped/sour cream	0.156
2	soda	1.000
2	rolls/buns	0.209
2	whole milk	0.165
2	bottled water	0.165
2	shopping bags	0.141
2	sausage	0.129
2	yogurt	0.127
2	pastry	0.109
3	rolls/buns	0.154
3	canned beer	0.105
3	yogurt	0.096
3	bottled water	0.087
3	shopping bags	0.086
3	bottled beer	0.080
3	tropical fruit	0.068
3	sausage	0.067
4	whole milk	1.000
4	rolls/buns	0.221
4	yogurt	0.183
4	root vegetables	0.146
4	tropical fruit	0.135
4	bottled water	0.117
4	pastry	0.112
4	newspapers	0.109

Cluster Interpretation

The cluster analysis gives a simple way to group transactions based on similar purchasing behavior. Some clusters are likely centered around staple items such as whole milk, rolls/buns, other vegetables, soda, yogurt, and bottled water. These groups can help a grocery store better understand different basket types, such as produce-heavy baskets, dairy-heavy baskets, or quick/snack-focused baskets.

This clustering is simple, but it still adds value because it shows that customers do not all shop the same way. Some customers buy core household staples, while others have baskets that focus more on beverages, snacks, or fresh items. A grocery store could use this type of information to improve store layout, targeted coupons, and product recommendations.

Conclusion

This Market Basket Analysis found grocery items that are commonly purchased together using association rules. Support, confidence, and lift helped evaluate the strength of each rule. The top 10 rules by lift highlight item combinations that happen together more often than expected by chance. The extra credit clustering analysis also grouped transactions into similar basket patterns, which could be useful for customer segmentation and marketing decisions.