Overview

This assignment uses the Groceries dataset to perform Market Basket Analysis. Each row in the data represents one customer transaction, and each column contains an item that appeared on that receipt. The goal is to find association rules that show which grocery items are often purchased together.

For the association rule mining, I report the three main measures:

  • Support: how often the item combination appears in all transactions.
  • Confidence: how often the rule is correct when the left-hand side item(s) appear.
  • Lift: how much more likely the right-hand side item(s) are purchased when the left-hand side item(s) are purchased, compared to normal purchasing behavior.

A lift value greater than 1 means the items appear together more often than expected by chance.

Load Packages

# Run these install lines only if the packages are not already installed:
# install.packages("arules")
# install.packages("knitr")

library(arules)
library(knitr)

Load and Clean the Data

# Make sure GroceryDataSet.csv is saved in the same folder as this Rmd file.
raw_data <- read.csv(
  "GroceryDataSet.csv",
  header = FALSE,
  stringsAsFactors = FALSE,
  na.strings = c("", "NA")
)

# Convert each row into a basket/list of items.
# trimws() removes extra spaces, and unique() avoids duplicate items within the same transaction.
basket_list <- lapply(seq_len(nrow(raw_data)), function(i) {
  items <- as.character(raw_data[i, ])
  items <- trimws(items)
  items <- items[!is.na(items) & items != ""]
  unique(items)
})

# Convert the basket list into transactions for the arules package.
groceries <- as(basket_list, "transactions")

# Basic dataset information.
num_transactions <- length(groceries)
num_items <- length(itemLabels(groceries))

cat("Number of transactions:", num_transactions, "\n")
## Number of transactions: 9835
cat("Number of unique items:", num_items, "\n")
## Number of unique items: 169

Basic Basket Summary

basket_sizes <- size(groceries)

summary_table <- data.frame(
  Measure = c("Minimum basket size", "Median basket size", "Mean basket size", "Maximum basket size"),
  Value = c(
    min(basket_sizes),
    median(basket_sizes),
    round(mean(basket_sizes), 2),
    max(basket_sizes)
  )
)

kable(summary_table, caption = "Summary of Basket Sizes")
Summary of Basket Sizes
Measure Value
Minimum basket size 1.00
Median basket size 3.00
Mean basket size 4.41
Maximum basket size 32.00
item_freq <- itemFrequency(groceries, type = "relative")
top_items <- sort(item_freq, decreasing = TRUE)[1:10]

top_items_table <- data.frame(
  Item = names(top_items),
  Support = round(as.numeric(top_items), 4),
  Percent_of_Transactions = round(as.numeric(top_items) * 100, 2)
)

kable(top_items_table, caption = "Top 10 Most Common Grocery Items")
Top 10 Most Common Grocery Items
Item Support Percent_of_Transactions
whole milk 0.2555 25.55
other vegetables 0.1935 19.35
rolls/buns 0.1839 18.39
soda 0.1744 17.44
yogurt 0.1395 13.95
bottled water 0.1105 11.05
root vegetables 0.1090 10.90
tropical fruit 0.1049 10.49
shopping bags 0.0985 9.85
sausage 0.0940 9.40

Association Rule Mining

For this analysis, I used a minimum support of 0.001 and a minimum confidence of 0.20. Since grocery baskets can contain many different item combinations, a low support threshold helps capture less common but still meaningful rules. I also limited the rules to a maximum length of 3 items to keep the final rules easier to interpret.

rules <- apriori(
  groceries,
  parameter = list(
    supp = 0.001,
    conf = 0.20,
    minlen = 2,
    maxlen = 3,
    target = "rules"
  )
)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5   0.001      2
##  maxlen target  ext
##       3  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3
##  done [0.00s].
## writing ... [9957 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
cat("Number of rules generated:", length(rules), "\n")
## Number of rules generated: 9957

Top 10 Rules by Lift

top10_rules <- sort(rules, by = "lift", decreasing = TRUE)[1:10]

q <- quality(top10_rules)

# Some versions of arules include count automatically. This makes the code work either way.
if (!"count" %in% names(q)) {
  q$count <- round(q$support * length(groceries))
}

top10_table <- data.frame(
  Rule = labels(top10_rules),
  Support = round(q$support, 4),
  Confidence = round(q$confidence, 4),
  Lift = round(q$lift, 3),
  Count = q$count
)

kable(top10_table, caption = "Top 10 Association Rules Sorted by Lift")
Top 10 Association Rules Sorted by Lift
Rule Support Confidence Lift Count
{bottled beer,red/blush wine} => {liquor} 0.0019 0.3958 35.716 19
{hamburger meat,soda} => {Instant food products} 0.0012 0.2105 26.209 12
{ham,white bread} => {processed cheese} 0.0019 0.3800 22.928 19
{bottled beer,liquor} => {red/blush wine} 0.0019 0.4130 21.494 19
{Instant food products,soda} => {hamburger meat} 0.0012 0.6316 18.996 12
{curd,sugar} => {flour} 0.0011 0.3235 18.608 11
{baking powder,sugar} => {flour} 0.0010 0.3125 17.973 10
{processed cheese,white bread} => {ham} 0.0019 0.4634 17.803 19
{fruit/vegetable juice,ham} => {processed cheese} 0.0011 0.2895 17.466 11
{margarine,sugar} => {flour} 0.0016 0.2963 17.041 16

Interpretation of Association Rules

The rules with the highest lift show item combinations that occur together much more often than expected by chance. For example, if a rule has a lift of 20, that means the item on the right-hand side is about 20 times more likely to appear when the item or items on the left-hand side appear, compared with a normal transaction.

The rules with the highest lift usually have low support because very specific combinations do not happen in a large percentage of all transactions. This does not mean the rules are useless; it means they should be interpreted as interesting purchasing patterns rather than broad trends across every customer.

From a business perspective, these rules could help with product placement, promotions, bundling, or understanding which items customers tend to buy together.

Extra Credit: Simple Cluster Analysis

For the extra credit section, I created a simple clustering analysis using the top 20 most common items. Each transaction was converted into a binary format where 1 means the item was purchased and 0 means it was not purchased. Then I used k-means clustering to group similar baskets together.

set.seed(123)

# Use only the top 20 most frequent items to keep the clustering simple and interpretable.
top20_items <- names(sort(itemFrequency(groceries), decreasing = TRUE))[1:20]

# Convert transactions to a binary matrix for only those top 20 items.
binary_matrix <- as(groceries[, top20_items], "matrix") * 1

# K-means clustering with 4 clusters.
km <- kmeans(binary_matrix, centers = 4, nstart = 25)

cluster_sizes <- table(km$cluster)
cluster_summary <- data.frame(
  Cluster = as.numeric(names(cluster_sizes)),
  Transactions = as.numeric(cluster_sizes),
  Percent = round(as.numeric(cluster_sizes) / length(groceries) * 100, 2)
)

kable(cluster_summary, caption = "Cluster Sizes")
Cluster Sizes
Cluster Transactions Percent
1 1789 18.19
2 1500 15.25
3 5017 51.01
4 1529 15.55
# Show the most common items within each cluster.
cluster_profiles <- do.call(rbind, lapply(sort(unique(km$cluster)), function(cluster_id) {
  cluster_data <- binary_matrix[km$cluster == cluster_id, , drop = FALSE]
  rates <- sort(colMeans(cluster_data), decreasing = TRUE)[1:8]
  data.frame(
    Cluster = cluster_id,
    Item = names(rates),
    Purchase_Rate = round(as.numeric(rates), 3)
  )
}))

kable(cluster_profiles, caption = "Most Common Items Within Each Cluster")
Most Common Items Within Each Cluster
Cluster Item Purchase_Rate
1 other vegetables 0.999
1 whole milk 0.411
1 root vegetables 0.262
1 yogurt 0.233
1 rolls/buns 0.215
1 tropical fruit 0.191
1 citrus fruit 0.157
1 whipped/sour cream 0.156
2 soda 1.000
2 rolls/buns 0.209
2 whole milk 0.165
2 bottled water 0.165
2 shopping bags 0.141
2 sausage 0.129
2 yogurt 0.127
2 pastry 0.109
3 rolls/buns 0.154
3 canned beer 0.105
3 yogurt 0.096
3 bottled water 0.087
3 shopping bags 0.086
3 bottled beer 0.080
3 tropical fruit 0.068
3 sausage 0.067
4 whole milk 1.000
4 rolls/buns 0.221
4 yogurt 0.183
4 root vegetables 0.146
4 tropical fruit 0.135
4 bottled water 0.117
4 pastry 0.112
4 newspapers 0.109

Cluster Interpretation

The cluster analysis gives a simple way to group transactions based on similar purchasing behavior. Some clusters are likely centered around staple items such as whole milk, rolls/buns, other vegetables, soda, yogurt, and bottled water. These groups can help a grocery store better understand different basket types, such as produce-heavy baskets, dairy-heavy baskets, or quick/snack-focused baskets.

This clustering is simple, but it still adds value because it shows that customers do not all shop the same way. Some customers buy core household staples, while others have baskets that focus more on beverages, snacks, or fresh items. A grocery store could use this type of information to improve store layout, targeted coupons, and product recommendations.

Conclusion

This Market Basket Analysis found grocery items that are commonly purchased together using association rules. Support, confidence, and lift helped evaluate the strength of each rule. The top 10 rules by lift highlight item combinations that happen together more often than expected by chance. The extra credit clustering analysis also grouped transactions into similar basket patterns, which could be useful for customer segmentation and marketing decisions.