DATA624 - HW10

Overview

This assignment applies Market Basket Analysis to identify patterns in customer purchasing behavior using a dataset of grocery transactions. Each transaction represents a receipt containing items purchased together, mimicking a real-world checkout basket. The goal is to discover association rules that reveal which products are frequently bought together.

Using the Apriori algorithm from the ‘arules’ package in R, we mined the dataset for rules that meet minimum thresholds of support and confidence. These rules help uncover relationships between products, such as “Customers who buy instant food products and soda often buy hamburger meat.” We then evaluated these rules using lift, which measures the strength of the association compared to random chance.

The top 10 rules by lift were analyzed and visualized to gain insight into customer buying habits. An optional cluster analysis was also performed to explore further groupings among products or transactions. To support this, we applied Multidimensional Scaling (MDS) to visualize how item clusters relate in two-dimensional space.

Some Quick Definitions

Market Basket Analysis: Finding what items are often bought together.
Association Rules: “If X, then Y” patterns in data.
Apriori Algorithm: A method to find frequent itemsets and rules efficiently.
Support: How often items appear together in all transactions.
Confidence: How often Y is bought when X is bought.
Lift: How much more likely Y is bought with X than by chance.
Cluster Analysis: Grouping similar data points (or transactions/items) together.

Load Required Library and Grocery Dataset

We begin by loading the arules package and importing the dataset using the read.transactions() function, which formats the data as a sparse matrix suitable for rule mining.

library(arules)
library(arulesViz)
library(igraph)
library(cluster)
library(dplyr)
library(ggplot2)

transactions <- read.transactions(file = "GroceryDataSet.csv", format = "basket", sep = ",")
summary(transactions)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

Summary Insights:

9835 transactions (i.e., receipts)
169 unique items
Sparsity ~2.6%, which means most transactions have only a few items (as expected in basket data)
Top frequent items: whole milk - 2513, other vegetables - 1903, rolls/buns - 1809, soda - 1715, yogurt - 1372.

Generate Association Rules Using Apriori

Next, we apply the Apriori algorithm with a minimum support of 0.001 and minimum confidence of 0.5 to generate meaningful rules. We then sort the rules by lift, a measure of how much stronger the rule is compared to chance.

rules <- apriori(transactions, parameter = list(supp = 0.001, conf = 0.5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [5668 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules_lift <- sort(rules, by = "lift", decreasing = TRUE)
inspect(head(rules_lift, 10))

##      lhs                         rhs                  support confidence    coverage     lift count
## [1]  {Instant food products,                                                                       
##       soda}                   => {hamburger meat} 0.001220132  0.6315789 0.001931876 18.99565    12
## [2]  {popcorn,                                                                                     
##       soda}                   => {salty snack}    0.001220132  0.6315789 0.001931876 16.69779    12
## [3]  {baking powder,                                                                               
##       flour}                  => {sugar}          0.001016777  0.5555556 0.001830198 16.40807    10
## [4]  {ham,                                                                                         
##       processed cheese}       => {white bread}    0.001931876  0.6333333 0.003050330 15.04549    19
## [5]  {Instant food products,                                                                       
##       whole milk}             => {hamburger meat} 0.001525165  0.5000000 0.003050330 15.03823    15
## [6]  {curd,                                                                                        
##       other vegetables,                                                                            
##       whipped/sour cream,                                                                          
##       yogurt}                 => {cream cheese}   0.001016777  0.5882353 0.001728521 14.83409    10
## [7]  {domestic eggs,                                                                               
##       processed cheese}       => {white bread}    0.001118454  0.5238095 0.002135231 12.44364    11
## [8]  {other vegetables,                                                                            
##       tropical fruit,                                                                              
##       white bread,                                                                                 
##       yogurt}                 => {butter}         0.001016777  0.6666667 0.001525165 12.03058    10
## [9]  {hamburger meat,                                                                              
##       whipped/sour cream,                                                                          
##       yogurt}                 => {butter}         0.001016777  0.6250000 0.001626843 11.27867    10
## [10] {domestic eggs,                                                                               
##       other vegetables,                                                                            
##       tropical fruit,                                                                              
##       whole milk,                                                                                  
##       yogurt}                 => {butter}         0.001016777  0.6250000 0.001626843 11.27867    10

Analysis of Top 10 Rules by Lift:

These top 10 rules reveal strong associations between product combinations. Below is a summary of a few key insights:

Rule 1: {Instant food products, soda} => {hamburger meat}: With a confidence of 63.2% and lift of 18.99, this rule indicates that customers who purchase both instant food and soda are almost 19 times more likely to buy hamburger meat compared to random chance.
Rule 2: {popcorn, soda} => {salty snack}: This shows a similar pattern with strong lift (16.70) and confidence, reflecting a likely snack-buying behavior.
Rule 4: {ham, processed cheese} => {white bread}: Suggests sandwich ingredients are often purchased together.
Rules 8-10: These involve combinations of dairy, vegetables, and bread products leading to the purchase of butter, reinforcing that these items are part of common meal preparation behavior.

Overall, the rules with high lift suggest meaningful cross-selling opportunities, such as grouping complementary items in store layouts or marketing campaigns.

Visualization

Plot Analysis

This plot shows the top 10 association rules using a fixed layout with seed = 42 for consistency. Each node (circle) represents a grocery item, with node size reflecting support (frequency in transactions) and color intensity representing lift (strength of association). Arrows show the direction of rules, such as {instant food products, soda} → {hamburger meat}. This consistent layout makes it easy to identify clusters of commonly co-purchased items.

Note: While the graph is slightly dense, it effectively communicates the top 10 association rules using node size (support), node color (lift), and arrows (rule direction). Label spacing could be improved with further customization, but this version preserves all relevant metrics and provides an accurate summary.

In addition to finding specific item-to-item rules, we also wanted to explore broader patterns of item similarity. This is where cluster analysis offers a complementary perspective.

Cluster Analysis of Frequently Purchased Items

Beyond analyzing individual association rules, it’s also valuable to understand how products are generally grouped based on purchase behavior. For this purpose, we performed a cluster analysis on the most frequently purchased items using hierarchical clustering.

Since the transaction data is stored as a sparse matrix, we first converted it into a binary format where each item was marked as present or absent in each transaction. To reduce noise and focus on more meaningful groupings, we limited our clustering to items that appeared in more than 100 transactions.

The items were then grouped based on co-occurrence similarity, revealing clusters of products that are commonly purchased together. While one cluster contained the majority of items, a few smaller clusters emerged, reflecting more specialized item groupings.

item_matrix <- as(transactions, "matrix")
item_matrix_t <- t(item_matrix)
item_matrix_top <- item_matrix_t[rowSums(item_matrix_t) > 100, ]

dissim <- dist(item_matrix_top, method = "euclidean")

hc <- hclust(dissim, method = "ward.D2")

clusters <- cutree(hc, k = 4)
clustered_items <- data.frame(
  Item = rownames(item_matrix_top),
  Cluster = clusters
)

Bar Plot of Item Clusters

The bar chart below summarizes the number of items in each cluster after grouping by co-purchase similarity:

# Create summary of cluster counts
cluster_counts <- clustered_items %>%
  count(Cluster) %>%
  arrange(desc(n)) %>%
  mutate(Cluster = factor(Cluster, levels = Cluster))

# Sort plot by count
ggplot(cluster_counts, aes(x = Cluster, y = n)) +
  geom_bar(stat = "identity", fill = "pink") +
  geom_text(aes(label = n), vjust = -0.4, color = "black") +
  labs(
    title = "Number of Items per Cluster",
    subtitle = "Clusters formed by item co-purchase similarity",
    x = "Cluster",
    y = "Number of Items"
  ) +
  theme_minimal(base_size = 13)

Plot Analysis This chart shows how frequently purchased items were grouped into clusters using hierarchical clustering. Cluster 1 contains the vast majority of items, suggesting strong shared purchase behavior among those products. The smaller clusters represent more niche groupings, potentially revealing specific product combinations or customer preferences. While the bar chart summarizes cluster sizes, the following MDS plot helps visualize how clusters relate spatially.

Multidimensional Projection (MDS):

To further explore how item clusters relate to one another in two-dimensional space, we used Multidimensional Scaling (MDS). This technique technique visually maps item dissimilarities, preserving relative distances in a 2D space.

mds_coords <- cmdscale(dissim, k = 2)
mds_df <- data.frame(mds_coords, Cluster = as.factor(clusters))

ggplot(mds_df, aes(x = X1, y = X2, color = Cluster)) +
  geom_point(size = 3) +
  labs(title = "MDS Projection of Item Clusters", x = "Dim 1", y = "Dim 2") +
  theme_minimal()

Plot Analysis The resulting scatter plot offers a visual representation of item similarity based on co-purchase behavior. Most items fall into a single dense cluster (Cluster 1), while a few items form smaller, more distinct groupings — confirming the patterns seen in the bar plot.

Conclusion

This assignment gave me the opportunity to apply Market Basket Analysis using the Apriori algorithm to uncover patterns in grocery transaction data. I learned how to generate association rules and interpret key metrics like support, confidence, and lift to identify items that are frequently purchased together. The top 10 rules revealed strong relationships between products, such as instant food and soda often being bought alongside hamburger meat.

Visualizing these rules as a network graph helped me see how products are interconnected, with node size and color conveying the frequency and strength of associations. Although the graph was a bit dense, it effectively highlighted key item relationships.

As part of the extra credit, I also explored cluster analysis. While more challenging to interpret, I learned how to group items based on shared purchasing behavior. Most items fell into a single dominant cluster, which suggests common co-purchase patterns. The bar plot helped summarize cluster sizes, while the MDS projection offered a visual interpretation of how these clusters relate in two-dimensional space.

Overall, this project helped me better understand how data mining techniques like association rule mining and clustering can uncover meaningful insights from real-world transaction data.

References

GeeksforGeeks. (2025, April 5). Apriori Algorithm. Retrieved from https://www.geeksforgeeks.org/apriori-algorithm/

GeeksforGeeks. (2025, April 24). Data Mining – Cluster Analysis. Retrieved from https://www.geeksforgeeks.org/data-mining-cluster-analysis/

GeeksforGeeks. (2024, May 19). What is Multidimensional Scaling?. Retrieved from https://www.geeksforgeeks.org/what-is-multidimensional-scaling/