This assignment uses the Groceries dataset to perform Market Basket Analysis. Each row in the data represents one customer transaction, and each column contains an item that appeared on that receipt. The goal is to find association rules that show which grocery items are often purchased together.
For the association rule mining, I report the three main measures:
A lift value greater than 1 means the items appear together more often than expected by chance.
# Run these install lines only if the packages are not already installed:
# install.packages("arules")
# install.packages("knitr")
library(arules)
library(knitr)
# Make sure GroceryDataSet.csv is saved in the same folder as this Rmd file.
raw_data <- read.csv(
"GroceryDataSet.csv",
header = FALSE,
stringsAsFactors = FALSE,
na.strings = c("", "NA")
)
# Convert each row into a basket/list of items.
# trimws() removes extra spaces, and unique() avoids duplicate items within the same transaction.
basket_list <- lapply(seq_len(nrow(raw_data)), function(i) {
items <- as.character(raw_data[i, ])
items <- trimws(items)
items <- items[!is.na(items) & items != ""]
unique(items)
})
# Convert the basket list into transactions for the arules package.
groceries <- as(basket_list, "transactions")
# Basic dataset information.
num_transactions <- length(groceries)
num_items <- length(itemLabels(groceries))
cat("Number of transactions:", num_transactions, "\n")
## Number of transactions: 9835
cat("Number of unique items:", num_items, "\n")
## Number of unique items: 169
basket_sizes <- size(groceries)
summary_table <- data.frame(
Measure = c("Minimum basket size", "Median basket size", "Mean basket size", "Maximum basket size"),
Value = c(
min(basket_sizes),
median(basket_sizes),
round(mean(basket_sizes), 2),
max(basket_sizes)
)
)
kable(summary_table, caption = "Summary of Basket Sizes")
| Measure | Value |
|---|---|
| Minimum basket size | 1.00 |
| Median basket size | 3.00 |
| Mean basket size | 4.41 |
| Maximum basket size | 32.00 |
item_freq <- itemFrequency(groceries, type = "relative")
top_items <- sort(item_freq, decreasing = TRUE)[1:10]
top_items_table <- data.frame(
Item = names(top_items),
Support = round(as.numeric(top_items), 4),
Percent_of_Transactions = round(as.numeric(top_items) * 100, 2)
)
kable(top_items_table, caption = "Top 10 Most Common Grocery Items")
| Item | Support | Percent_of_Transactions |
|---|---|---|
| whole milk | 0.2555 | 25.55 |
| other vegetables | 0.1935 | 19.35 |
| rolls/buns | 0.1839 | 18.39 |
| soda | 0.1744 | 17.44 |
| yogurt | 0.1395 | 13.95 |
| bottled water | 0.1105 | 11.05 |
| root vegetables | 0.1090 | 10.90 |
| tropical fruit | 0.1049 | 10.49 |
| shopping bags | 0.0985 | 9.85 |
| sausage | 0.0940 | 9.40 |
For this analysis, I used a minimum support of 0.001 and a minimum confidence of 0.20. Since grocery baskets can contain many different item combinations, a low support threshold helps capture less common but still meaningful rules. I also limited the rules to a maximum length of 3 items to keep the final rules easier to interpret.
rules <- apriori(
groceries,
parameter = list(
supp = 0.001,
conf = 0.20,
minlen = 2,
maxlen = 3,
target = "rules"
)
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.001 2
## maxlen target ext
## 3 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3
## done [0.00s].
## writing ... [9957 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
cat("Number of rules generated:", length(rules), "\n")
## Number of rules generated: 9957
top10_rules <- sort(rules, by = "lift", decreasing = TRUE)[1:10]
q <- quality(top10_rules)
# Some versions of arules include count automatically. This makes the code work either way.
if (!"count" %in% names(q)) {
q$count <- round(q$support * length(groceries))
}
top10_table <- data.frame(
Rule = labels(top10_rules),
Support = round(q$support, 4),
Confidence = round(q$confidence, 4),
Lift = round(q$lift, 3),
Count = q$count
)
kable(top10_table, caption = "Top 10 Association Rules Sorted by Lift")
| Rule | Support | Confidence | Lift | Count |
|---|---|---|---|---|
| {bottled beer,red/blush wine} => {liquor} | 0.0019 | 0.3958 | 35.716 | 19 |
| {hamburger meat,soda} => {Instant food products} | 0.0012 | 0.2105 | 26.209 | 12 |
| {ham,white bread} => {processed cheese} | 0.0019 | 0.3800 | 22.928 | 19 |
| {bottled beer,liquor} => {red/blush wine} | 0.0019 | 0.4130 | 21.494 | 19 |
| {Instant food products,soda} => {hamburger meat} | 0.0012 | 0.6316 | 18.996 | 12 |
| {curd,sugar} => {flour} | 0.0011 | 0.3235 | 18.608 | 11 |
| {baking powder,sugar} => {flour} | 0.0010 | 0.3125 | 17.973 | 10 |
| {processed cheese,white bread} => {ham} | 0.0019 | 0.4634 | 17.803 | 19 |
| {fruit/vegetable juice,ham} => {processed cheese} | 0.0011 | 0.2895 | 17.466 | 11 |
| {margarine,sugar} => {flour} | 0.0016 | 0.2963 | 17.041 | 16 |
The rules with the highest lift show item combinations that occur together much more often than expected by chance. For example, if a rule has a lift of 20, that means the item on the right-hand side is about 20 times more likely to appear when the item or items on the left-hand side appear, compared with a normal transaction.
The rules with the highest lift usually have low support because very specific combinations do not happen in a large percentage of all transactions. This does not mean the rules are useless; it means they should be interpreted as interesting purchasing patterns rather than broad trends across every customer.
From a business perspective, these rules could help with product placement, promotions, bundling, or understanding which items customers tend to buy together.
For the extra credit section, I created a simple clustering analysis using the top 20 most common items. Each transaction was converted into a binary format where 1 means the item was purchased and 0 means it was not purchased. Then I used k-means clustering to group similar baskets together.
set.seed(123)
# Use only the top 20 most frequent items to keep the clustering simple and interpretable.
top20_items <- names(sort(itemFrequency(groceries), decreasing = TRUE))[1:20]
# Convert transactions to a binary matrix for only those top 20 items.
binary_matrix <- as(groceries[, top20_items], "matrix") * 1
# K-means clustering with 4 clusters.
km <- kmeans(binary_matrix, centers = 4, nstart = 25)
cluster_sizes <- table(km$cluster)
cluster_summary <- data.frame(
Cluster = as.numeric(names(cluster_sizes)),
Transactions = as.numeric(cluster_sizes),
Percent = round(as.numeric(cluster_sizes) / length(groceries) * 100, 2)
)
kable(cluster_summary, caption = "Cluster Sizes")
| Cluster | Transactions | Percent |
|---|---|---|
| 1 | 1789 | 18.19 |
| 2 | 1500 | 15.25 |
| 3 | 5017 | 51.01 |
| 4 | 1529 | 15.55 |
# Show the most common items within each cluster.
cluster_profiles <- do.call(rbind, lapply(sort(unique(km$cluster)), function(cluster_id) {
cluster_data <- binary_matrix[km$cluster == cluster_id, , drop = FALSE]
rates <- sort(colMeans(cluster_data), decreasing = TRUE)[1:8]
data.frame(
Cluster = cluster_id,
Item = names(rates),
Purchase_Rate = round(as.numeric(rates), 3)
)
}))
kable(cluster_profiles, caption = "Most Common Items Within Each Cluster")
| Cluster | Item | Purchase_Rate |
|---|---|---|
| 1 | other vegetables | 0.999 |
| 1 | whole milk | 0.411 |
| 1 | root vegetables | 0.262 |
| 1 | yogurt | 0.233 |
| 1 | rolls/buns | 0.215 |
| 1 | tropical fruit | 0.191 |
| 1 | citrus fruit | 0.157 |
| 1 | whipped/sour cream | 0.156 |
| 2 | soda | 1.000 |
| 2 | rolls/buns | 0.209 |
| 2 | whole milk | 0.165 |
| 2 | bottled water | 0.165 |
| 2 | shopping bags | 0.141 |
| 2 | sausage | 0.129 |
| 2 | yogurt | 0.127 |
| 2 | pastry | 0.109 |
| 3 | rolls/buns | 0.154 |
| 3 | canned beer | 0.105 |
| 3 | yogurt | 0.096 |
| 3 | bottled water | 0.087 |
| 3 | shopping bags | 0.086 |
| 3 | bottled beer | 0.080 |
| 3 | tropical fruit | 0.068 |
| 3 | sausage | 0.067 |
| 4 | whole milk | 1.000 |
| 4 | rolls/buns | 0.221 |
| 4 | yogurt | 0.183 |
| 4 | root vegetables | 0.146 |
| 4 | tropical fruit | 0.135 |
| 4 | bottled water | 0.117 |
| 4 | pastry | 0.112 |
| 4 | newspapers | 0.109 |
The cluster analysis gives a simple way to group transactions based on similar purchasing behavior. Some clusters are likely centered around staple items such as whole milk, rolls/buns, other vegetables, soda, yogurt, and bottled water. These groups can help a grocery store better understand different basket types, such as produce-heavy baskets, dairy-heavy baskets, or quick/snack-focused baskets.
This clustering is simple, but it still adds value because it shows that customers do not all shop the same way. Some customers buy core household staples, while others have baskets that focus more on beverages, snacks, or fresh items. A grocery store could use this type of information to improve store layout, targeted coupons, and product recommendations.
This Market Basket Analysis found grocery items that are commonly purchased together using association rules. Support, confidence, and lift helped evaluate the strength of each rule. The top 10 rules by lift highlight item combinations that happen together more often than expected by chance. The extra credit clustering analysis also grouped transactions into similar basket patterns, which could be useful for customer segmentation and marketing decisions.