1 Introduction

The goal of this assignment is to use grocery receipt data to perform a market basket analysis. Each row in the data represents one customer transaction, and each column in that row represents an item purchased in that same basket. I used the Apriori algorithm to identify association rules and report the main rule-quality metrics: support, confidence, and lift.

For the extra credit portion, I also ran a simple clustering analysis using the most frequently purchased items. The clustering section gives a broad view of common basket types, while the association-rule section gives more detailed item-to-item relationships.

To connect this assignment more closely to predictive analytics, I also added a train/test validation step. The idea is to mine rules on a training set and then check whether those rules still appear to perform reasonably well on unseen test transactions.

2 Load and Prepare the Data

data_url <- "https://raw.githubusercontent.com/JaydeeJan/Data-624-HW-10/refs/heads/main/GroceryDataSet.csv"

grocery_raw <- read.csv(
  data_url,
  header = FALSE,
  stringsAsFactors = FALSE,
  na.strings = c("", "NA"),
  strip.white = TRUE
)

# clean the text values by trimming extra spaces
grocery_clean <- grocery_raw %>%
  mutate(across(everything(), ~str_trim(as.character(.x))))

# convert each row into a list of items
transaction_list <- lapply(seq_len(nrow(grocery_clean)), function(i) {
  items <- unlist(grocery_clean[i, ], use.names = FALSE)
  items <- items[!is.na(items) & items != ""]
  unique(items)
})

# convert the list of baskets into an arules transactions object
grocery_transactions <- as(transaction_list, "transactions")

# basic structure of the transaction data
summary(grocery_transactions)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

The data was converted from a wide receipt format into a transaction object. This makes it possible to use the Apriori algorithm directly. I also removed blank cells and trimmed extra spaces so that items with the same name would not be counted as different products.

3 Exploratory Summary

# basic transaction-level summary
transaction_summary <- tibble(
  total_transactions = length(grocery_transactions),
  unique_items = length(itemLabels(grocery_transactions)),
  average_basket_size = mean(size(grocery_transactions)),
  median_basket_size = median(size(grocery_transactions)),
  minimum_basket_size = min(size(grocery_transactions)),
  maximum_basket_size = max(size(grocery_transactions))
)

kable(
  transaction_summary,
  caption = "Basic Summary of Grocery Transactions",
  digits = 3
)
Basic Summary of Grocery Transactions
total_transactions unique_items average_basket_size median_basket_size minimum_basket_size maximum_basket_size
9835 169 4.409 3 1 32
# plot shows the top items by relative frequency
itemFrequencyPlot(
  grocery_transactions,
  topN = 20,
  type = "relative",
  main = "Top 20 Grocery Items by Relative Frequency",
  ylab = "Relative Frequency"
)

From the frequency summary, the most common items are broad grocery staples such as whole milk, other vegetables, rolls/buns, soda, and yogurt. This is useful context because common items may appear in many rules, but a common item alone does not necessarily mean the rule is interesting. That is why lift is important in the association-rule analysis.

4 Association Rule Mining

For this assignment, I used the Apriori algorithm. The three rule metrics are:

  • Support: the proportion of transactions that contain both the left-hand-side itemset and the right-hand-side itemset.
  • Confidence: the proportion of transactions with the left-hand-side itemset that also contain the right-hand-side itemset.
  • Lift: how much more often the left-hand-side and right-hand-side appear together compared with what would be expected if they were independent. A lift greater than 1 means the items are positively associated.

I used a minimum support of 0.001, which means a rule must appear in about 10 transactions because the dataset has 9,835 transactions. I used a minimum confidence of 0.20, meaning the right-hand-side itemset must appear at least 20% of the time when the left-hand-side itemset appears. I also limited the maximum rule length to 4 items to keep the rules interpretable.

min_support <- 0.001
min_confidence <- 0.20

rules <- apriori (
  grocery_transactions,
  parameter = list(
    supp = min_support,
    conf = min_confidence,
    minlen = 2,
    maxlen = 4,
    target = "rules"
  ),
  control = list(verbose = FALSE)
)

# remove redundant rules so the final rule list is cleaner
rules <- rules[!is.redundant(rules)]

# create a cleaner summary table instead of printing the full raw rule summary
rules_summary_table <- tibble(
  number_of_rules = length(rules),
  average_support = mean(quality(rules)$support),
  median_support = median(quality(rules)$support),
  average_confidence = mean(quality(rules)$confidence),
  median_confidence = median(quality(rules)$confidence),
  average_lift = mean(quality(rules)$lift),
  median_lift = median(quality(rules)$lift),
  max_lift = max(quality(rules)$lift)
)

kable(
  rules_summary_table,
  caption = "Summary of Association Rule Quality Measures",
  digits = 3
)
Summary of Association Rule Quality Measures
number_of_rules average_support median_support average_confidence median_confidence average_lift median_lift max_lift
17261 0.002 0.001 0.391 0.347 3.073 2.81 35.716

The Apriori model generated 17,261 non-redundant rules. The average lift is about 3.07, which suggests that many rules show positive association between itemsets. However, the average support is low, so many of these rules represent specific item combinations rather than very common basket patterns.

# sort by lift and keep the top 10 rules
top_10_rules <- sort(rules, by = "lift", decreasing = TRUE)[1:10]

# create a clean table for reporting
top_10_table <- tibble(
  rule = labels(top_10_rules),
  support = round(quality(top_10_rules)$support, 5),
  confidence = round(quality(top_10_rules)$confidence, 3),
  lift = round(quality(top_10_rules)$lift, 2),
  count = quality(top_10_rules)$count
)

kable(top_10_table, caption = "Top 10 Association Rules by Lift")
Top 10 Association Rules by Lift
rule support confidence lift count
{bottled beer,red/blush wine} => {liquor} 0.00193 0.396 35.72 19
{hamburger meat,soda} => {Instant food products} 0.00122 0.211 26.21 12
{ham,white bread} => {processed cheese} 0.00193 0.380 22.93 19
{bottled beer,liquor} => {red/blush wine} 0.00193 0.413 21.49 19
{Instant food products,soda} => {hamburger meat} 0.00122 0.632 19.00 12
{curd,sugar} => {flour} 0.00112 0.324 18.61 11
{baking powder,sugar} => {flour} 0.00102 0.312 17.97 10
{processed cheese,white bread} => {ham} 0.00193 0.463 17.80 19
{fruit/vegetable juice,ham} => {processed cheese} 0.00112 0.289 17.47 11
{margarine,sugar} => {flour} 0.00163 0.296 17.04 16

The top rules by lift show combinations that occur together much more often than random chance would suggest. In my run, many of the strongest lift rules are not the highest-support rules. This means they are interesting but also relatively rare. For example, rules involving combinations such as beer/wine/liquor, hamburger meat/soda/instant food products, processed cheese/ham/white bread, and baking-related items show very strong associations, but they occur in a small number of baskets.

Because of this, I would not interpret the top-lift rules as the most important products overall. Instead, I would interpret them as specific niche purchasing patterns. For store decision-making, these rules could be useful for small promotions, aisle placement, or bundle suggestions, but I would still check the support count before making large inventory decisions.

4.1 Rule Validation Using a Train/Test Split

To make the analysis closer to a predictive analytics workflow, I also used a train/test split. The association rules were mined only from the training transactions, and then the top 10 training rules were evaluated on the test transactions. This is similar to validating a predictive model on unseen data. The goal is to check whether the strongest rules from the training data still show reasonable support, confidence, and lift in new baskets.

set.seed(624)

# split the transactions into training and testing sets
n_transactions <- length(grocery_transactions)

train_index <- sample(
  seq_len(n_transactions),
  size = round(0.80 * n_transactions)
)

train_transactions <- grocery_transactions[train_index]
test_transactions <- grocery_transactions[-train_index]

# mine rules only on the training data
train_rules <- apriori(
  train_transactions,
  parameter = list(
    supp = min_support,
    conf = min_confidence,
    minlen = 2,
    maxlen = 4,
    target = "rules"
  ),
  control = list(verbose = FALSE)
)

train_rules <- train_rules[!is.redundant(train_rules)]

# keep the top 10 training rules by lift
top_10_train_rules <- sort(train_rules, by = "lift", decreasing = TRUE)[1:10]

# this function evaluates one rule on a new transaction set
evaluate_rule <- function(rule, transactions) {
  
  transaction_matrix <- as(transactions, "matrix")
  
  lhs_items <- LIST(lhs(rule), decode = TRUE)[[1]]
  rhs_items <- LIST(rhs(rule), decode = TRUE)[[1]]
  
  all_items <- c(lhs_items, rhs_items)
  
  if (!all(all_items %in% colnames(transaction_matrix))) {
    return(
      tibble(
        rule = labels(rule),
        test_support = 0,
        test_confidence = NA,
        test_lift = NA,
        test_count = 0
      )
    )
  }
  
  lhs_present <- rowSums(transaction_matrix[, lhs_items, drop = FALSE]) == length(lhs_items)
  rhs_present <- rowSums(transaction_matrix[, rhs_items, drop = FALSE]) == length(rhs_items)
  both_present <- lhs_present & rhs_present
  
  test_support <- mean(both_present)
  test_confidence <- ifelse(
    sum(lhs_present) == 0,
    NA,
    sum(both_present) / sum(lhs_present)
  )
  
  rhs_support <- mean(rhs_present)
  test_lift <- ifelse(
    rhs_support == 0,
    NA,
    test_confidence / rhs_support
  )
  
  test_count <- sum(both_present)
  
  tibble(
    rule = labels(rule),
    test_support = round(test_support, 5),
    test_confidence = round(test_confidence, 3),
    test_lift = round(test_lift, 2),
    test_count = test_count
  )
}

# training metrics for the top training rules
train_validation_table <- tibble(
  rule = labels(top_10_train_rules),
  train_support = round(quality(top_10_train_rules)$support, 5),
  train_confidence = round(quality(top_10_train_rules)$confidence, 3),
  train_lift = round(quality(top_10_train_rules)$lift, 2),
  train_count = quality(top_10_train_rules)$count
)

# test metrics for the same rules
test_validation_table <- map_dfr(
  seq_len(length(top_10_train_rules)),
  function(i) evaluate_rule(top_10_train_rules[i], test_transactions)
)

validation_table <- train_validation_table %>%
  left_join(test_validation_table, by = "rule")

kable(
  validation_table,
  caption = "Validation of Top Training Rules on Test Transactions"
)
Validation of Top Training Rules on Test Transactions
rule train_support train_confidence train_lift train_count test_support test_confidence test_lift test_count
{bottled beer,red/blush wine} => {liquor} 0.00216 0.395 34.56 17 0.00102 0.400 41.41 2
{ham,white bread,whole milk} => {processed cheese} 0.00102 0.421 26.29 8 0.00051 0.200 10.63 1
{curd,sugar,whole milk} => {flour} 0.00114 0.429 24.98 9 0.00000 0.000 0.00 0
{bottled beer,liquor} => {red/blush wine} 0.00216 0.436 21.98 17 0.00102 0.286 17.03 2
{ham,white bread} => {processed cheese} 0.00165 0.351 21.94 13 0.00305 0.462 24.54 6
{oil,pastry} => {detergent} 0.00102 0.364 19.46 8 0.00000 0.000 0.00 0
{baking powder,sugar} => {flour} 0.00114 0.333 19.43 9 0.00051 0.200 10.93 1
{curd,sugar} => {flour} 0.00114 0.321 18.73 9 0.00102 0.333 18.21 2
{baking powder,margarine} => {flour} 0.00102 0.320 18.65 8 0.00051 0.200 10.93 1
{margarine,whipped/sour cream,yogurt} => {flour} 0.00102 0.308 17.93 8 0.00000 0.000 0.00 0

The validation results show that the top training rules do not all generalize equally to the test set. Some rules still have high lift in the test transactions, but their test counts are very small. For example, the alcohol-related rules still have strong lift, but they only appear in a few test baskets. Several other rules have a test count of 0, which means those specific combinations did not appear in the test sample.

This is useful from a predictive analytics perspective because it shows why rule validation matters. A rule can look strong in the training data because of high lift, but if it has very low support or does not appear in the test set, I would be careful about using it for business decisions. I would trust a rule more if it has high lift, reasonable confidence, and a test count that is not too small.

5 Rule Visualization

rules_plot_data <- tibble(
  support = quality(rules)$support,
  confidence = quality(rules)$confidence,
  lift = quality(rules)$lift
)

ggplot(rules_plot_data, aes(x = support, y = confidence, color = lift)) +
  geom_point(alpha = 0.65) +
  labs(
    title = "Association Rules: Support, Confidence, and Lift",
    x = "Support",
    y = "Confidence",
    color = "Lift"
  ) +
  theme_minimal()

top_10_table %>%
  mutate(
    rule_wrapped = str_wrap(rule, width = 45),
    rule_wrapped = reorder(rule_wrapped, lift)
  ) %>%
  ggplot(aes(x = rule_wrapped, y = lift)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 10 Association Rules by Lift",
    x = "Association Rule",
    y = "Lift"
  ) +
  theme_minimal()

The scatter plot shows the relationship between support, confidence, and lift for the association rules. Most rules have low support, which means they appear in a small proportion of total baskets. This is expected in grocery transaction data because many item combinations are specific and do not occur very often.

The bar chart highlights the top 10 rules ranked by lift. These rules have strong associations, but the earlier table shows that many of them also have low counts. Because of that, I would use these rules as possible promotional or product-placement ideas, but I would not rely on lift alone.

6 Extra Credit: Simple Cluster Analysis

For the cluster analysis, I used the 30 most frequently purchased items and converted each transaction into a binary row. A value of 1 means the item appeared in the transaction, and a value of 0 means it did not. I used k-means clustering as a simple way to group similar baskets.

# convert the transaction data into a binary item matrix
item_matrix <- as(grocery_transactions, "matrix")

# use the top 30 items to keep the clustering simple and interpretable
top_30_items <- itemFrequency(grocery_transactions, type = "absolute") %>%
  sort(decreasing = TRUE) %>%
  head(30) %>%
  names()

cluster_data <- as.data.frame(
  as.matrix(item_matrix[, top_30_items]),
  check.names = FALSE
) %>%
  mutate(across(everything(), as.numeric))

cluster_data[1:5, 1:8]
# use an elbow plot to compare different values of k
set.seed(624)

wss <- map_dbl(1:8, function(k) {
  kmeans(cluster_data, centers = k, nstart = 25)$tot.withinss
})

wss_table <- tibble(
  k = 1:8,
  total_within_cluster_ss = wss
)

kable(
  wss_table,
  caption = "Within-Cluster Sum of Squares by Number of Clusters",
  digits = 2
)
Within-Cluster Sum of Squares by Number of Clusters
k total_within_cluster_ss
1 23978.83
2 21917.85
3 20770.19
4 19815.81
5 19136.69
6 18640.54
7 18332.90
8 17952.73
ggplot(wss_table, aes(x = k, y = total_within_cluster_ss)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Elbow Plot for K-Means Clustering",
    x = "Number of Clusters (k)",
    y = "Total Within-Cluster Sum of Squares"
  ) +
  theme_minimal()

I selected k = 4 as a simple and interpretable clustering solution. The goal here is not to build a perfect segmentation model, but to create a basic summary of different basket patterns in the grocery data.

# final k-means model using k = 4
set.seed(624)

final_k <- 4

kmeans_model <- kmeans(
  cluster_data,
  centers = final_k,
  nstart = 50
)

cluster_assignments <- tibble(
  transaction_id = seq_len(nrow(cluster_data)),
  cluster = factor(kmeans_model$cluster),
  basket_size = size(grocery_transactions)
)

cluster_sizes <- cluster_assignments %>%
  count(cluster) %>%
  mutate(percent = round(n / sum(n) * 100, 1))

kable(
  cluster_sizes,
  caption = "Cluster Sizes"
)
Cluster Sizes
cluster n percent
1 1806 18.4
2 5013 51.0
3 1467 14.9
4 1549 15.7
# profile each cluster by showing the most common items inside that cluster
cluster_basket_summary <- cluster_assignments %>%
  group_by(cluster) %>%
  summarise(
    n = n(),
    average_basket_size = round(mean(basket_size), 2),
    median_basket_size = median(basket_size),
    minimum_basket_size = min(basket_size),
    maximum_basket_size = max(basket_size),
    .groups = "drop"
  )

kable(
  cluster_basket_summary,
  caption = "Basket Size Summary by Cluster"
)
Basket Size Summary by Cluster
cluster n average_basket_size median_basket_size minimum_basket_size maximum_basket_size
1 1806 7.51 7 1 32
2 5013 2.79 2 1 21
3 1467 5.03 4 1 21
4 1549 5.45 5 1 29
# profile each cluster by showing the most common items inside that cluster
cluster_profiles <- cluster_data %>%
  mutate(cluster = factor(kmeans_model$cluster)) %>%
  pivot_longer(
    cols = -cluster,
    names_to = "item",
    values_to = "item_present"
  ) %>%
  group_by(cluster, item) %>%
  summarise(item_rate = mean(item_present), .groups = "drop") %>%
  group_by(cluster) %>%
  arrange(desc(item_rate), .by_group = TRUE) %>%
  slice_head(n = 8) %>%
  mutate(item_rate = round(item_rate, 3))

kable(
  cluster_profiles,
  caption = "Top Items Within Each Cluster"
)
Top Items Within Each Cluster
cluster item item_rate
1 other vegetables 0.997
1 whole milk 0.408
1 root vegetables 0.261
1 yogurt 0.234
1 rolls/buns 0.215
1 tropical fruit 0.190
1 citrus fruit 0.158
1 whipped/sour cream 0.158
2 rolls/buns 0.154
2 canned beer 0.105
2 yogurt 0.096
2 bottled water 0.087
2 shopping bags 0.086
2 bottled beer 0.080
2 tropical fruit 0.068
2 newspapers 0.067
3 soda 1.000
3 rolls/buns 0.206
3 bottled water 0.166
3 whole milk 0.155
3 shopping bags 0.142
3 sausage 0.127
3 yogurt 0.121
3 pastry 0.110
4 whole milk 1.000
4 rolls/buns 0.224
4 yogurt 0.189
4 root vegetables 0.150
4 tropical fruit 0.139
4 bottled water 0.117
4 pastry 0.112
4 newspapers 0.107
# PCA is used only for visualization because the original cluster data has 30 dimensions.
pca_result <- prcomp(cluster_data, center = TRUE, scale. = FALSE)

pca_plot_data <- tibble(
  PC1 = pca_result$x[, 1],
  PC2 = pca_result$x[, 2],
  cluster = factor(kmeans_model$cluster)
)

ggplot(pca_plot_data, aes(x = PC1, y = PC2, color = cluster)) +
  geom_point(alpha = 0.35, size = 0.8) +
  labs(
    title = "K-Means Grocery Basket Clusters Shown with PCA",
    x = "Principal Component 1",
    y = "Principal Component 2",
    color = "Cluster"
  ) +
  theme_minimal()

The cluster profiles give a simple view of common shopping patterns. Cluster 1 contains 18.4% of the transactions and has the largest average basket size at about 7.51 items. This cluster is strongly associated with other vegetables, along with whole milk, root vegetables, yogurt, and rolls/buns. Cluster 2 is the largest group with 51.0% of the transactions, but it has the smallest average basket size at about 2.79 items, so I would interpret it as a low-item mixed-basket group. Cluster 3 is mainly associated with soda baskets, while Cluster 4 is mainly associated with whole milk baskets.

I would treat this clustering result as a broad descriptive summary rather than a final customer segmentation model. One limitation is that k-means is being applied to binary purchase data, while k-means is usually better suited for continuous numeric variables. For this assignment, it is still useful as a simple extra-credit clustering method, but a distance-based method such as Jaccard distance could be more appropriate for basket data.

7 Conclusion

The market basket analysis found that the grocery data contains both broad staple-item patterns and more specific high-lift combinations. Whole milk, other vegetables, rolls/buns, soda, and yogurt are among the most frequent items overall. The top association rules by lift show stronger niche relationships, such as alcohol-related combinations, convenience-food combinations, sandwich-related combinations, and baking-related combinations.

The most important caution is that high lift does not always mean high business impact. Some of the highest-lift rules have low support, meaning they happen in only a small share of total baskets. A practical recommendation would be to use lift together with support and confidence. Rules with high lift and reasonable support would be the best candidates for store layout changes, promotions, or recommendation systems.

The train/test validation step makes the analysis closer to a predictive analytics workflow because the rules are not only discovered in one dataset but checked on unseen transactions. This helps identify whether a rule may generalize or whether it may only be strong because of rare combinations in the training data.

The extra-credit clustering analysis supports the same general conclusion: grocery baskets can be grouped into broad shopping patterns based on common staple items, but the most interesting detailed relationships are better captured by association rules.