The goal of this assignment is to use grocery receipt data to perform a market basket analysis. Each row in the data represents one customer transaction, and each column in that row represents an item purchased in that same basket. I used the Apriori algorithm to identify association rules and report the main rule-quality metrics: support, confidence, and lift.
For the extra credit portion, I also ran a simple clustering analysis using the most frequently purchased items. The clustering section gives a broad view of common basket types, while the association-rule section gives more detailed item-to-item relationships.
To connect this assignment more closely to predictive analytics, I also added a train/test validation step. The idea is to mine rules on a training set and then check whether those rules still appear to perform reasonably well on unseen test transactions.
data_url <- "https://raw.githubusercontent.com/JaydeeJan/Data-624-HW-10/refs/heads/main/GroceryDataSet.csv"
grocery_raw <- read.csv(
data_url,
header = FALSE,
stringsAsFactors = FALSE,
na.strings = c("", "NA"),
strip.white = TRUE
)
# clean the text values by trimming extra spaces
grocery_clean <- grocery_raw %>%
mutate(across(everything(), ~str_trim(as.character(.x))))
# convert each row into a list of items
transaction_list <- lapply(seq_len(nrow(grocery_clean)), function(i) {
items <- unlist(grocery_clean[i, ], use.names = FALSE)
items <- items[!is.na(items) & items != ""]
unique(items)
})
# convert the list of baskets into an arules transactions object
grocery_transactions <- as(transaction_list, "transactions")
# basic structure of the transaction data
summary(grocery_transactions)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
The data was converted from a wide receipt format into a transaction object. This makes it possible to use the Apriori algorithm directly. I also removed blank cells and trimmed extra spaces so that items with the same name would not be counted as different products.
# basic transaction-level summary
transaction_summary <- tibble(
total_transactions = length(grocery_transactions),
unique_items = length(itemLabels(grocery_transactions)),
average_basket_size = mean(size(grocery_transactions)),
median_basket_size = median(size(grocery_transactions)),
minimum_basket_size = min(size(grocery_transactions)),
maximum_basket_size = max(size(grocery_transactions))
)
kable(
transaction_summary,
caption = "Basic Summary of Grocery Transactions",
digits = 3
)
| total_transactions | unique_items | average_basket_size | median_basket_size | minimum_basket_size | maximum_basket_size |
|---|---|---|---|---|---|
| 9835 | 169 | 4.409 | 3 | 1 | 32 |
# plot shows the top items by relative frequency
itemFrequencyPlot(
grocery_transactions,
topN = 20,
type = "relative",
main = "Top 20 Grocery Items by Relative Frequency",
ylab = "Relative Frequency"
)
From the frequency summary, the most common items are broad grocery staples such as whole milk, other vegetables, rolls/buns, soda, and yogurt. This is useful context because common items may appear in many rules, but a common item alone does not necessarily mean the rule is interesting. That is why lift is important in the association-rule analysis.
For this assignment, I used the Apriori algorithm. The three rule metrics are:
I used a minimum support of 0.001, which means a rule
must appear in about 10 transactions because the dataset has 9,835
transactions. I used a minimum confidence of 0.20, meaning
the right-hand-side itemset must appear at least 20% of the time when
the left-hand-side itemset appears. I also limited the maximum rule
length to 4 items to keep the rules interpretable.
min_support <- 0.001
min_confidence <- 0.20
rules <- apriori (
grocery_transactions,
parameter = list(
supp = min_support,
conf = min_confidence,
minlen = 2,
maxlen = 4,
target = "rules"
),
control = list(verbose = FALSE)
)
# remove redundant rules so the final rule list is cleaner
rules <- rules[!is.redundant(rules)]
# create a cleaner summary table instead of printing the full raw rule summary
rules_summary_table <- tibble(
number_of_rules = length(rules),
average_support = mean(quality(rules)$support),
median_support = median(quality(rules)$support),
average_confidence = mean(quality(rules)$confidence),
median_confidence = median(quality(rules)$confidence),
average_lift = mean(quality(rules)$lift),
median_lift = median(quality(rules)$lift),
max_lift = max(quality(rules)$lift)
)
kable(
rules_summary_table,
caption = "Summary of Association Rule Quality Measures",
digits = 3
)
| number_of_rules | average_support | median_support | average_confidence | median_confidence | average_lift | median_lift | max_lift |
|---|---|---|---|---|---|---|---|
| 17261 | 0.002 | 0.001 | 0.391 | 0.347 | 3.073 | 2.81 | 35.716 |
The Apriori model generated 17,261 non-redundant rules. The average lift is about 3.07, which suggests that many rules show positive association between itemsets. However, the average support is low, so many of these rules represent specific item combinations rather than very common basket patterns.
# sort by lift and keep the top 10 rules
top_10_rules <- sort(rules, by = "lift", decreasing = TRUE)[1:10]
# create a clean table for reporting
top_10_table <- tibble(
rule = labels(top_10_rules),
support = round(quality(top_10_rules)$support, 5),
confidence = round(quality(top_10_rules)$confidence, 3),
lift = round(quality(top_10_rules)$lift, 2),
count = quality(top_10_rules)$count
)
kable(top_10_table, caption = "Top 10 Association Rules by Lift")
| rule | support | confidence | lift | count |
|---|---|---|---|---|
| {bottled beer,red/blush wine} => {liquor} | 0.00193 | 0.396 | 35.72 | 19 |
| {hamburger meat,soda} => {Instant food products} | 0.00122 | 0.211 | 26.21 | 12 |
| {ham,white bread} => {processed cheese} | 0.00193 | 0.380 | 22.93 | 19 |
| {bottled beer,liquor} => {red/blush wine} | 0.00193 | 0.413 | 21.49 | 19 |
| {Instant food products,soda} => {hamburger meat} | 0.00122 | 0.632 | 19.00 | 12 |
| {curd,sugar} => {flour} | 0.00112 | 0.324 | 18.61 | 11 |
| {baking powder,sugar} => {flour} | 0.00102 | 0.312 | 17.97 | 10 |
| {processed cheese,white bread} => {ham} | 0.00193 | 0.463 | 17.80 | 19 |
| {fruit/vegetable juice,ham} => {processed cheese} | 0.00112 | 0.289 | 17.47 | 11 |
| {margarine,sugar} => {flour} | 0.00163 | 0.296 | 17.04 | 16 |
The top rules by lift show combinations that occur together much more often than random chance would suggest. In my run, many of the strongest lift rules are not the highest-support rules. This means they are interesting but also relatively rare. For example, rules involving combinations such as beer/wine/liquor, hamburger meat/soda/instant food products, processed cheese/ham/white bread, and baking-related items show very strong associations, but they occur in a small number of baskets.
Because of this, I would not interpret the top-lift rules as the most important products overall. Instead, I would interpret them as specific niche purchasing patterns. For store decision-making, these rules could be useful for small promotions, aisle placement, or bundle suggestions, but I would still check the support count before making large inventory decisions.
To make the analysis closer to a predictive analytics workflow, I also used a train/test split. The association rules were mined only from the training transactions, and then the top 10 training rules were evaluated on the test transactions. This is similar to validating a predictive model on unseen data. The goal is to check whether the strongest rules from the training data still show reasonable support, confidence, and lift in new baskets.
set.seed(624)
# split the transactions into training and testing sets
n_transactions <- length(grocery_transactions)
train_index <- sample(
seq_len(n_transactions),
size = round(0.80 * n_transactions)
)
train_transactions <- grocery_transactions[train_index]
test_transactions <- grocery_transactions[-train_index]
# mine rules only on the training data
train_rules <- apriori(
train_transactions,
parameter = list(
supp = min_support,
conf = min_confidence,
minlen = 2,
maxlen = 4,
target = "rules"
),
control = list(verbose = FALSE)
)
train_rules <- train_rules[!is.redundant(train_rules)]
# keep the top 10 training rules by lift
top_10_train_rules <- sort(train_rules, by = "lift", decreasing = TRUE)[1:10]
# this function evaluates one rule on a new transaction set
evaluate_rule <- function(rule, transactions) {
transaction_matrix <- as(transactions, "matrix")
lhs_items <- LIST(lhs(rule), decode = TRUE)[[1]]
rhs_items <- LIST(rhs(rule), decode = TRUE)[[1]]
all_items <- c(lhs_items, rhs_items)
if (!all(all_items %in% colnames(transaction_matrix))) {
return(
tibble(
rule = labels(rule),
test_support = 0,
test_confidence = NA,
test_lift = NA,
test_count = 0
)
)
}
lhs_present <- rowSums(transaction_matrix[, lhs_items, drop = FALSE]) == length(lhs_items)
rhs_present <- rowSums(transaction_matrix[, rhs_items, drop = FALSE]) == length(rhs_items)
both_present <- lhs_present & rhs_present
test_support <- mean(both_present)
test_confidence <- ifelse(
sum(lhs_present) == 0,
NA,
sum(both_present) / sum(lhs_present)
)
rhs_support <- mean(rhs_present)
test_lift <- ifelse(
rhs_support == 0,
NA,
test_confidence / rhs_support
)
test_count <- sum(both_present)
tibble(
rule = labels(rule),
test_support = round(test_support, 5),
test_confidence = round(test_confidence, 3),
test_lift = round(test_lift, 2),
test_count = test_count
)
}
# training metrics for the top training rules
train_validation_table <- tibble(
rule = labels(top_10_train_rules),
train_support = round(quality(top_10_train_rules)$support, 5),
train_confidence = round(quality(top_10_train_rules)$confidence, 3),
train_lift = round(quality(top_10_train_rules)$lift, 2),
train_count = quality(top_10_train_rules)$count
)
# test metrics for the same rules
test_validation_table <- map_dfr(
seq_len(length(top_10_train_rules)),
function(i) evaluate_rule(top_10_train_rules[i], test_transactions)
)
validation_table <- train_validation_table %>%
left_join(test_validation_table, by = "rule")
kable(
validation_table,
caption = "Validation of Top Training Rules on Test Transactions"
)
| rule | train_support | train_confidence | train_lift | train_count | test_support | test_confidence | test_lift | test_count |
|---|---|---|---|---|---|---|---|---|
| {bottled beer,red/blush wine} => {liquor} | 0.00216 | 0.395 | 34.56 | 17 | 0.00102 | 0.400 | 41.41 | 2 |
| {ham,white bread,whole milk} => {processed cheese} | 0.00102 | 0.421 | 26.29 | 8 | 0.00051 | 0.200 | 10.63 | 1 |
| {curd,sugar,whole milk} => {flour} | 0.00114 | 0.429 | 24.98 | 9 | 0.00000 | 0.000 | 0.00 | 0 |
| {bottled beer,liquor} => {red/blush wine} | 0.00216 | 0.436 | 21.98 | 17 | 0.00102 | 0.286 | 17.03 | 2 |
| {ham,white bread} => {processed cheese} | 0.00165 | 0.351 | 21.94 | 13 | 0.00305 | 0.462 | 24.54 | 6 |
| {oil,pastry} => {detergent} | 0.00102 | 0.364 | 19.46 | 8 | 0.00000 | 0.000 | 0.00 | 0 |
| {baking powder,sugar} => {flour} | 0.00114 | 0.333 | 19.43 | 9 | 0.00051 | 0.200 | 10.93 | 1 |
| {curd,sugar} => {flour} | 0.00114 | 0.321 | 18.73 | 9 | 0.00102 | 0.333 | 18.21 | 2 |
| {baking powder,margarine} => {flour} | 0.00102 | 0.320 | 18.65 | 8 | 0.00051 | 0.200 | 10.93 | 1 |
| {margarine,whipped/sour cream,yogurt} => {flour} | 0.00102 | 0.308 | 17.93 | 8 | 0.00000 | 0.000 | 0.00 | 0 |
The validation results show that the top training rules do not all generalize equally to the test set. Some rules still have high lift in the test transactions, but their test counts are very small. For example, the alcohol-related rules still have strong lift, but they only appear in a few test baskets. Several other rules have a test count of 0, which means those specific combinations did not appear in the test sample.
This is useful from a predictive analytics perspective because it shows why rule validation matters. A rule can look strong in the training data because of high lift, but if it has very low support or does not appear in the test set, I would be careful about using it for business decisions. I would trust a rule more if it has high lift, reasonable confidence, and a test count that is not too small.
rules_plot_data <- tibble(
support = quality(rules)$support,
confidence = quality(rules)$confidence,
lift = quality(rules)$lift
)
ggplot(rules_plot_data, aes(x = support, y = confidence, color = lift)) +
geom_point(alpha = 0.65) +
labs(
title = "Association Rules: Support, Confidence, and Lift",
x = "Support",
y = "Confidence",
color = "Lift"
) +
theme_minimal()
top_10_table %>%
mutate(
rule_wrapped = str_wrap(rule, width = 45),
rule_wrapped = reorder(rule_wrapped, lift)
) %>%
ggplot(aes(x = rule_wrapped, y = lift)) +
geom_col() +
coord_flip() +
labs(
title = "Top 10 Association Rules by Lift",
x = "Association Rule",
y = "Lift"
) +
theme_minimal()
The scatter plot shows the relationship between support, confidence, and lift for the association rules. Most rules have low support, which means they appear in a small proportion of total baskets. This is expected in grocery transaction data because many item combinations are specific and do not occur very often.
The bar chart highlights the top 10 rules ranked by lift. These rules have strong associations, but the earlier table shows that many of them also have low counts. Because of that, I would use these rules as possible promotional or product-placement ideas, but I would not rely on lift alone.
For the cluster analysis, I used the 30 most frequently purchased items and converted each transaction into a binary row. A value of 1 means the item appeared in the transaction, and a value of 0 means it did not. I used k-means clustering as a simple way to group similar baskets.
# convert the transaction data into a binary item matrix
item_matrix <- as(grocery_transactions, "matrix")
# use the top 30 items to keep the clustering simple and interpretable
top_30_items <- itemFrequency(grocery_transactions, type = "absolute") %>%
sort(decreasing = TRUE) %>%
head(30) %>%
names()
cluster_data <- as.data.frame(
as.matrix(item_matrix[, top_30_items]),
check.names = FALSE
) %>%
mutate(across(everything(), as.numeric))
cluster_data[1:5, 1:8]
# use an elbow plot to compare different values of k
set.seed(624)
wss <- map_dbl(1:8, function(k) {
kmeans(cluster_data, centers = k, nstart = 25)$tot.withinss
})
wss_table <- tibble(
k = 1:8,
total_within_cluster_ss = wss
)
kable(
wss_table,
caption = "Within-Cluster Sum of Squares by Number of Clusters",
digits = 2
)
| k | total_within_cluster_ss |
|---|---|
| 1 | 23978.83 |
| 2 | 21917.85 |
| 3 | 20770.19 |
| 4 | 19815.81 |
| 5 | 19136.69 |
| 6 | 18640.54 |
| 7 | 18332.90 |
| 8 | 17952.73 |
ggplot(wss_table, aes(x = k, y = total_within_cluster_ss)) +
geom_line() +
geom_point() +
labs(
title = "Elbow Plot for K-Means Clustering",
x = "Number of Clusters (k)",
y = "Total Within-Cluster Sum of Squares"
) +
theme_minimal()
I selected k = 4 as a simple and interpretable
clustering solution. The goal here is not to build a perfect
segmentation model, but to create a basic summary of different basket
patterns in the grocery data.
# final k-means model using k = 4
set.seed(624)
final_k <- 4
kmeans_model <- kmeans(
cluster_data,
centers = final_k,
nstart = 50
)
cluster_assignments <- tibble(
transaction_id = seq_len(nrow(cluster_data)),
cluster = factor(kmeans_model$cluster),
basket_size = size(grocery_transactions)
)
cluster_sizes <- cluster_assignments %>%
count(cluster) %>%
mutate(percent = round(n / sum(n) * 100, 1))
kable(
cluster_sizes,
caption = "Cluster Sizes"
)
| cluster | n | percent |
|---|---|---|
| 1 | 1806 | 18.4 |
| 2 | 5013 | 51.0 |
| 3 | 1467 | 14.9 |
| 4 | 1549 | 15.7 |
# profile each cluster by showing the most common items inside that cluster
cluster_basket_summary <- cluster_assignments %>%
group_by(cluster) %>%
summarise(
n = n(),
average_basket_size = round(mean(basket_size), 2),
median_basket_size = median(basket_size),
minimum_basket_size = min(basket_size),
maximum_basket_size = max(basket_size),
.groups = "drop"
)
kable(
cluster_basket_summary,
caption = "Basket Size Summary by Cluster"
)
| cluster | n | average_basket_size | median_basket_size | minimum_basket_size | maximum_basket_size |
|---|---|---|---|---|---|
| 1 | 1806 | 7.51 | 7 | 1 | 32 |
| 2 | 5013 | 2.79 | 2 | 1 | 21 |
| 3 | 1467 | 5.03 | 4 | 1 | 21 |
| 4 | 1549 | 5.45 | 5 | 1 | 29 |
# profile each cluster by showing the most common items inside that cluster
cluster_profiles <- cluster_data %>%
mutate(cluster = factor(kmeans_model$cluster)) %>%
pivot_longer(
cols = -cluster,
names_to = "item",
values_to = "item_present"
) %>%
group_by(cluster, item) %>%
summarise(item_rate = mean(item_present), .groups = "drop") %>%
group_by(cluster) %>%
arrange(desc(item_rate), .by_group = TRUE) %>%
slice_head(n = 8) %>%
mutate(item_rate = round(item_rate, 3))
kable(
cluster_profiles,
caption = "Top Items Within Each Cluster"
)
| cluster | item | item_rate |
|---|---|---|
| 1 | other vegetables | 0.997 |
| 1 | whole milk | 0.408 |
| 1 | root vegetables | 0.261 |
| 1 | yogurt | 0.234 |
| 1 | rolls/buns | 0.215 |
| 1 | tropical fruit | 0.190 |
| 1 | citrus fruit | 0.158 |
| 1 | whipped/sour cream | 0.158 |
| 2 | rolls/buns | 0.154 |
| 2 | canned beer | 0.105 |
| 2 | yogurt | 0.096 |
| 2 | bottled water | 0.087 |
| 2 | shopping bags | 0.086 |
| 2 | bottled beer | 0.080 |
| 2 | tropical fruit | 0.068 |
| 2 | newspapers | 0.067 |
| 3 | soda | 1.000 |
| 3 | rolls/buns | 0.206 |
| 3 | bottled water | 0.166 |
| 3 | whole milk | 0.155 |
| 3 | shopping bags | 0.142 |
| 3 | sausage | 0.127 |
| 3 | yogurt | 0.121 |
| 3 | pastry | 0.110 |
| 4 | whole milk | 1.000 |
| 4 | rolls/buns | 0.224 |
| 4 | yogurt | 0.189 |
| 4 | root vegetables | 0.150 |
| 4 | tropical fruit | 0.139 |
| 4 | bottled water | 0.117 |
| 4 | pastry | 0.112 |
| 4 | newspapers | 0.107 |
# PCA is used only for visualization because the original cluster data has 30 dimensions.
pca_result <- prcomp(cluster_data, center = TRUE, scale. = FALSE)
pca_plot_data <- tibble(
PC1 = pca_result$x[, 1],
PC2 = pca_result$x[, 2],
cluster = factor(kmeans_model$cluster)
)
ggplot(pca_plot_data, aes(x = PC1, y = PC2, color = cluster)) +
geom_point(alpha = 0.35, size = 0.8) +
labs(
title = "K-Means Grocery Basket Clusters Shown with PCA",
x = "Principal Component 1",
y = "Principal Component 2",
color = "Cluster"
) +
theme_minimal()
The cluster profiles give a simple view of common shopping patterns. Cluster 1 contains 18.4% of the transactions and has the largest average basket size at about 7.51 items. This cluster is strongly associated with other vegetables, along with whole milk, root vegetables, yogurt, and rolls/buns. Cluster 2 is the largest group with 51.0% of the transactions, but it has the smallest average basket size at about 2.79 items, so I would interpret it as a low-item mixed-basket group. Cluster 3 is mainly associated with soda baskets, while Cluster 4 is mainly associated with whole milk baskets.
I would treat this clustering result as a broad descriptive summary rather than a final customer segmentation model. One limitation is that k-means is being applied to binary purchase data, while k-means is usually better suited for continuous numeric variables. For this assignment, it is still useful as a simple extra-credit clustering method, but a distance-based method such as Jaccard distance could be more appropriate for basket data.
The market basket analysis found that the grocery data contains both broad staple-item patterns and more specific high-lift combinations. Whole milk, other vegetables, rolls/buns, soda, and yogurt are among the most frequent items overall. The top association rules by lift show stronger niche relationships, such as alcohol-related combinations, convenience-food combinations, sandwich-related combinations, and baking-related combinations.
The most important caution is that high lift does not always mean high business impact. Some of the highest-lift rules have low support, meaning they happen in only a small share of total baskets. A practical recommendation would be to use lift together with support and confidence. Rules with high lift and reasonable support would be the best candidates for store layout changes, promotions, or recommendation systems.
The train/test validation step makes the analysis closer to a predictive analytics workflow because the rules are not only discovered in one dataset but checked on unseen transactions. This helps identify whether a rule may generalize or whether it may only be strong because of rare combinations in the training data.
The extra-credit clustering analysis supports the same general conclusion: grocery baskets can be grouped into broad shopping patterns based on common staple items, but the most interesting detailed relationships are better captured by association rules.