Grocery Market Basket Analysis + Clustering

Introduction

This assignment uses a grocery transaction dataset. Each row represents one customer receipt/ transaction. Each column in a row represents an item purchased in that transaction. The goal is to use R to mine the data for association rules, reporting support, confidence and lift and your top 10 rules by lift. As well as a simple cluster analysis on the data.

Load the Dataset

groceries_raw <- read.csv("GroceryDataSet.csv",
                          header = FALSE,
                          stringsAsFactors = FALSE)

dim(groceries_raw)
## [1] 9835   32
head(groceries_raw)
##                 V1                  V2             V3                       V4
## 1     citrus fruit semi-finished bread      margarine              ready soups
## 2   tropical fruit              yogurt         coffee                         
## 3       whole milk                                                            
## 4        pip fruit              yogurt  cream cheese              meat spreads
## 5 other vegetables          whole milk condensed milk long life bakery product
## 6       whole milk              butter         yogurt                     rice
##                 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
## 1                                                                             
## 2                                                                             
## 3                                                                             
## 4                                                                             
## 5                                                                             
## 6 abrasive cleaner                                                            
##   V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32
## 1                                            
## 2                                            
## 3                                            
## 4                                            
## 5                                            
## 6
summary(groceries_raw)
##       V1                 V2                 V3                 V4           
##  Length:9835        Length:9835        Length:9835        Length:9835       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##       V5                 V6                 V7                 V8           
##  Length:9835        Length:9835        Length:9835        Length:9835       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##       V9                V10                V11                V12           
##  Length:9835        Length:9835        Length:9835        Length:9835       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      V13                V14                V15                V16           
##  Length:9835        Length:9835        Length:9835        Length:9835       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      V17                V18                V19                V20           
##  Length:9835        Length:9835        Length:9835        Length:9835       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      V21                V22                V23                V24           
##  Length:9835        Length:9835        Length:9835        Length:9835       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      V25                V26                V27                V28           
##  Length:9835        Length:9835        Length:9835        Length:9835       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      V29                V30                V31                V32           
##  Length:9835        Length:9835        Length:9835        Length:9835       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character

Data Cleaning

In this dataset, missing values do not mean errors; different receipts contain different numbers of items.

library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
groceries_raw[groceries_raw == ""] <- NA

# Remove extra spaces from item names
groceries_raw[] <- lapply(groceries_raw, function(x) trimws(x))

# Convert each row into a transaction list
groceries_list <- apply(groceries_raw, 1, function(x) {
  x <- na.omit(x)
  x <- x[x != ""]
  unique(x)
})

# Convert list 
groceries_trans <- as(groceries_list, "transactions")

summary(groceries_trans)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics
inspect(groceries_trans[1:5])
##     items                      
## [1] {citrus fruit,             
##      margarine,                
##      ready soups,              
##      semi-finished bread}      
## [2] {coffee,                   
##      tropical fruit,           
##      yogurt}                   
## [3] {whole milk}               
## [4] {cream cheese,             
##      meat spreads,             
##      pip fruit,                
##      yogurt}                   
## [5] {condensed milk,           
##      long life bakery product, 
##      other vegetables,         
##      whole milk}

Item Frequency Exploration

itemFrequencyPlot(groceries_trans,
                  topN = 20,
                  type = "absolute",
                  main = "Top 20 Most Frequently Purchased Items",
                  col = "lightgreen")

# Mining the date for association rules

# First model: moderate support/confidence
rules_1 <- apriori(
  groceries_trans,
  parameter = list(
    supp = 0.001,
    conf = 0.20,
    minlen = 2
  )
)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5   0.001      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [21633 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(rules_1)
## set of 21633 rules
## 
## rule length distribution (lhs + rhs):sizes
##    2    3    4    5    6 
##  620 9337 9824 1792   60 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   4.000   3.599   4.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift        
##  Min.   :0.001017   Min.   :0.2000   Min.   :0.001017   Min.   : 0.8028  
##  1st Qu.:0.001118   1st Qu.:0.2632   1st Qu.:0.002745   1st Qu.: 2.1178  
##  Median :0.001322   Median :0.3548   Median :0.004169   Median : 2.7571  
##  Mean   :0.001948   Mean   :0.3967   Mean   :0.005840   Mean   : 3.0214  
##  3rd Qu.:0.001932   3rd Qu.:0.5000   3rd Qu.:0.006101   3rd Qu.: 3.6148  
##  Max.   :0.074835   Max.   :1.0000   Max.   :0.255516   Max.   :35.7158  
##      count       
##  Min.   : 10.00  
##  1st Qu.: 11.00  
##  Median : 13.00  
##  Mean   : 19.15  
##  3rd Qu.: 19.00  
##  Max.   :736.00  
## 
## mining info:
##             data ntransactions support confidence
##  groceries_trans          9835   0.001        0.2
##                                                                                     call
##  apriori(data = groceries_trans, parameter = list(supp = 0.001, conf = 0.2, minlen = 2))
# Second model: slightly stricter confidence
rules_2 <- apriori(
  groceries_trans,
  parameter = list(
    supp = 0.002,
    conf = 0.30,
    minlen = 2
  )
)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5   0.002      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 19 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [147 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [3119 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(rules_2)
## set of 3119 rules
## 
## rule length distribution (lhs + rhs):sizes
##    2    3    4    5 
##  181 1863 1001   74 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    3.00    3.00    3.31    4.00    5.00 
## 
## summary of quality measures:
##     support           confidence        coverage             lift       
##  Min.   :0.002034   Min.   :0.3000   Min.   :0.002440   Min.   : 1.176  
##  1st Qu.:0.002339   1st Qu.:0.3607   1st Qu.:0.005084   1st Qu.: 1.957  
##  Median :0.002847   Median :0.4369   Median :0.006711   Median : 2.356  
##  Mean   :0.003931   Mean   :0.4558   Mean   :0.009361   Mean   : 2.517  
##  3rd Qu.:0.003965   3rd Qu.:0.5355   3rd Qu.:0.009558   3rd Qu.: 2.904  
##  Max.   :0.074835   Max.   :0.8857   Max.   :0.193493   Max.   :11.421  
##      count       
##  Min.   : 20.00  
##  1st Qu.: 23.00  
##  Median : 28.00  
##  Mean   : 38.66  
##  3rd Qu.: 39.00  
##  Max.   :736.00  
## 
## mining info:
##             data ntransactions support confidence
##  groceries_trans          9835   0.002        0.3
##                                                                                     call
##  apriori(data = groceries_trans, parameter = list(supp = 0.002, conf = 0.3, minlen = 2))
# Third model: lower support to discover less frequent but possibly strong rules
rules_3 <- apriori(
  groceries_trans,
  parameter = list(
    supp = 0.0005,
    conf = 0.20,
    minlen = 2
  )
)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5   5e-04      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 4 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [164 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 done [0.01s].
## writing ... [111652 rule(s)] done [0.01s].
## creating S4 object  ... done [0.01s].
summary(rules_3)
## set of 111652 rules
## 
## rule length distribution (lhs + rhs):sizes
##     2     3     4     5     6     7 
##   768 22681 55303 27954  4680   266 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   4.000   4.000   4.124   5.000   7.000 
## 
## summary of quality measures:
##     support            confidence        coverage              lift        
##  Min.   :0.0005084   Min.   :0.2000   Min.   :0.0005084   Min.   : 0.7827  
##  1st Qu.:0.0005084   1st Qu.:0.2857   1st Qu.:0.0011185   1st Qu.: 2.6091  
##  Median :0.0006101   Median :0.4000   Median :0.0017285   Median : 3.6494  
##  Mean   :0.0008728   Mean   :0.4546   Mean   :0.0024302   Mean   : 4.4219  
##  3rd Qu.:0.0008134   3rd Qu.:0.5833   3rd Qu.:0.0025419   3rd Qu.: 5.1982  
##  Max.   :0.0748348   Max.   :1.0000   Max.   :0.2555160   Max.   :81.9583  
##      count        
##  Min.   :  5.000  
##  1st Qu.:  5.000  
##  Median :  6.000  
##  Mean   :  8.584  
##  3rd Qu.:  8.000  
##  Max.   :736.000  
## 
## mining info:
##             data ntransactions support confidence
##  groceries_trans          9835   5e-04        0.2
##                                                                                     call
##  apriori(data = groceries_trans, parameter = list(supp = 5e-04, conf = 0.2, minlen = 2))

Association rules were generated using the Apriori algorithm to identify relationships between items in the dataset. The strength of these rules is evaluated using three key measures: support, confidence, and lift.

Support represents how frequently a combination of items appears in the dataset. In this analysis, support values are generally low, with a mean around 0.002, indicating that most item combinations occur in a small proportion of transactions.

Confidence measures the likelihood that a transaction containing one item will also contain another item. The average confidence across rules is approximately 0.40–0.45, meaning that when the left-hand side of a rule occurs, there is about a 40–45% chance that the right-hand side also occurs.

Lift measures how much more likely items are to occur together compared to if they were independent. The average lift is greater than 1 (around 2.5–4.4 depending on parameters), indicating that many item combinations occur more frequently than expected by chance. Some rules have very high lift values, suggesting strong associations between certain items.

Overall, while most rules have low support, the confidence and lift values indicate meaningful relationships between items in the dataset.

Top 10 Rules by Lift

rules_lift <- sort(rules_1, by = "lift", decreasing = TRUE)

top10_rules <- head(rules_lift, 10)

inspect(top10_rules)
##      lhs                                rhs                     support    
## [1]  {bottled beer, red/blush wine}  => {liquor}                0.001931876
## [2]  {hamburger meat, soda}          => {Instant food products} 0.001220132
## [3]  {ham, white bread}              => {processed cheese}      0.001931876
## [4]  {bottled beer, liquor}          => {red/blush wine}        0.001931876
## [5]  {Instant food products, soda}   => {hamburger meat}        0.001220132
## [6]  {curd, sugar}                   => {flour}                 0.001118454
## [7]  {baking powder, sugar}          => {flour}                 0.001016777
## [8]  {processed cheese, white bread} => {ham}                   0.001931876
## [9]  {fruit/vegetable juice, ham}    => {processed cheese}      0.001118454
## [10] {margarine, sugar}              => {flour}                 0.001626843
##      confidence coverage    lift     count
## [1]  0.3958333  0.004880529 35.71579 19   
## [2]  0.2105263  0.005795628 26.20919 12   
## [3]  0.3800000  0.005083884 22.92822 19   
## [4]  0.4130435  0.004677173 21.49356 19   
## [5]  0.6315789  0.001931876 18.99565 12   
## [6]  0.3235294  0.003457041 18.60767 11   
## [7]  0.3125000  0.003253686 17.97332 10   
## [8]  0.4634146  0.004168785 17.80345 19   
## [9]  0.2894737  0.003863752 17.46610 11   
## [10] 0.2962963  0.005490595 17.04137 16
# Table with support, confidence, and lift
top10_table <- as(top10_rules, "data.frame")

top10_table <- top10_table[, c("rules", "support", "confidence", "lift")]

print(top10_table)
##                                                  rules     support confidence
## 633          {bottled beer,red/blush wine} => {liquor} 0.001931876  0.3958333
## 696   {hamburger meat,soda} => {Instant food products} 0.001220132  0.2105263
## 1489           {ham,white bread} => {processed cheese} 0.001931876  0.3800000
## 632          {bottled beer,liquor} => {red/blush wine} 0.001931876  0.4130435
## 695   {Instant food products,soda} => {hamburger meat} 0.001220132  0.6315789
## 2022                           {curd,sugar} => {flour} 0.001118454  0.3235294
## 1916                  {baking powder,sugar} => {flour} 0.001016777  0.3125000
## 1488           {processed cheese,white bread} => {ham} 0.001931876  0.4634146
## 1492 {fruit/vegetable juice,ham} => {processed cheese} 0.001118454  0.2894737
## 2025                      {margarine,sugar} => {flour} 0.001626843  0.2962963
##          lift
## 633  35.71579
## 696  26.20919
## 1489 22.92822
## 632  21.49356
## 695  18.99565
## 2022 18.60767
## 1916 17.97332
## 1488 17.80345
## 1492 17.46610
## 2025 17.04137
# Top 10 rules
write.csv(top10_table,
          "top10_grocery_rules_by_lift.csv",
          row.names = FALSE)

The top 10 association rules were selected based on the highest lift values, indicating the strongest relationships between items.

The strongest rule shows that customers who purchase bottled beer and red/blush wine are highly likely to also purchase liquor, with a lift of 35.72. This means this combination occurs over 35 times more often than expected by chance, indicating a very strong association.

Another strong rule indicates that customers who buy hamburger meat and soda are likely to also purchase instant food products (lift = 26.21). This suggests a pattern of convenience or quick-meal purchases.

The rule {ham, white bread} → {processed cheese} (lift = 22.93) reflects a common food pairing, suggesting customers are purchasing ingredients for sandwiches.

Similarly, customers who buy instant food products and soda are highly likely to also purchase hamburger meat (confidence = 0.63), indicating a strong likelihood of these items being purchased together.

Several rules involve baking-related items. For example, combinations such as {curd, sugar}, {baking powder, sugar}, and {margarine, sugar} all lead to flour, suggesting that these items are commonly purchased together for baking purposes.

Overall, while the support values are low (indicating these combinations are relatively rare), the high lift values show that these item combinations are significantly more likely to occur together than by chance. This highlights meaningful purchasing patterns and associations within the dataset.

These findings demonstrate how association rule mining can uncover meaningful consumer purchasing behavior.

Visualize Association Rules

library(arulesViz)

plot(top10_rules,
     method = "scatter",
     measure = c("support", "confidence"),
     shading = "lift")

plot(top10_rules,
     method = "graph",
     engine = "htmlwidget")

This scatter plot displays the top 10 association rules, with support on the x-axis and confidence on the y-axis. Each point represents a rule, and the color intensity indicates the lift value, where darker red corresponds to stronger associations.

The color scale shows that many of these rules have high lift values, with some exceeding 30. This indicates that these item combinations occur far more frequently than expected by chance, highlighting strong and meaningful relationships between items.

Overall, the plot demonstrates that even though these rules are relatively rare (low support), they represent strong and reliable associations (high confidence and lift), making them valuable for understanding customer purchasing behavior.

Simple Cluster Analysis

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:arules':
## 
##     intersect, recode, setdiff, setequal, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Convert transactions to a binary item matrix
binary_matrix <- as(groceries_trans, "matrix")

# Top 30 most frequent items
item_freq <- itemFrequency(groceries_trans, type = "absolute")
top_items <- names(sort(item_freq, decreasing = TRUE))[1:30]

top_binary_matrix <- binary_matrix[, top_items]

# Transpose 
item_matrix <- t(top_binary_matrix)

# Compute distance between items
dist_matrix <- dist(item_matrix, method = "binary")

# Hierarchical clustering
hc <- hclust(dist_matrix, method = "complete")

# Plot 
plot(hc,
     main = "Hierarchical Clustering of Top 30 Grocery Items",
     xlab = "",
     sub = "",
     cex = 0.7)

# Tree into 5 clusters
clusters <- cutree(hc, k = 5)

clusters
##            whole milk      other vegetables            rolls/buns 
##                     1                     1                     2 
##                  soda                yogurt         bottled water 
##                     3                     1                     3 
##       root vegetables        tropical fruit         shopping bags 
##                     1                     1                     2 
##               sausage                pastry          citrus fruit 
##                     2                     2                     1 
##          bottled beer            newspapers           canned beer 
##                     4                     3                     5 
##             pip fruit fruit/vegetable juice    whipped/sour cream 
##                     1                     3                     1 
##           brown bread         domestic eggs           frankfurter 
##                     2                     1                     2 
##             margarine                coffee                  pork 
##                     1                     4                     1 
##                butter                  curd                  beef 
##                     1                     1                     1 
##               napkins             chocolate     frozen vegetables 
##                     3                     3                     1
# Cluster table
cluster_table <- data.frame(
  item = names(clusters),
  cluster = clusters
)

cluster_table <- cluster_table %>%
  arrange(cluster, item)

print(cluster_table)
##                                        item cluster
## beef                                   beef       1
## butter                               butter       1
## citrus fruit                   citrus fruit       1
## curd                                   curd       1
## domestic eggs                 domestic eggs       1
## frozen vegetables         frozen vegetables       1
## margarine                         margarine       1
## other vegetables           other vegetables       1
## pip fruit                         pip fruit       1
## pork                                   pork       1
## root vegetables             root vegetables       1
## tropical fruit               tropical fruit       1
## whipped/sour cream       whipped/sour cream       1
## whole milk                       whole milk       1
## yogurt                               yogurt       1
## brown bread                     brown bread       2
## frankfurter                     frankfurter       2
## pastry                               pastry       2
## rolls/buns                       rolls/buns       2
## sausage                             sausage       2
## shopping bags                 shopping bags       2
## bottled water                 bottled water       3
## chocolate                         chocolate       3
## fruit/vegetable juice fruit/vegetable juice       3
## napkins                             napkins       3
## newspapers                       newspapers       3
## soda                                   soda       3
## bottled beer                   bottled beer       4
## coffee                               coffee       4
## canned beer                     canned beer       5

Hierarchical clustering was used to group the top 30 most frequently purchased grocery items based on similarity in purchasing patterns. The figure visually represents how items are grouped together, with items that are closer on the tree being more similar in terms of how often they are purchased together.

The clustering results show several meaningful groupings. For example, items such as bread, sausage, pastry, and rolls/buns are grouped together, suggesting they are commonly purchased as part of meal or breakfast combinations. Another cluster includes beverages such as bottled water, soda, and fruit/vegetable juice, indicating that these items are often purchased together.

Items like bottled beer, coffee, and canned beer form another cluster, reflecting beverage-related purchasing patterns. Additionally, yogurt appears as its own cluster, indicating that it may be purchased independently or does not strongly co-occur with other items in the top 30.

Overall, the clustering analysis complements the association rules by identifying groups of similar items, while the association rules identify directional relationships between specific item combinations. Together, these methods provide a deeper understanding of customer purchasing behavior.

Conclusion

This analysis used market basket analysis to identify relationships between grocery items that are frequently purchased together. The Apriori algorithm generated association rules, and the top 10 rules were ranked by lift. Lift was particularly useful because it highlights relationships that occur more frequently than expected by chance, allowing for the identification of strong and meaningful item associations.

The data were examined for missing values, and blank entries were treated as the absence of items in a transaction rather than missing data. Therefore, no imputation was required, as each transaction simply contained a different number of purchased items.

The results revealed clear patterns in consumer purchasing behavior, such as common combinations of convenience foods, beverages, and baking-related items. These insights can be valuable for retail strategies, including product placement, promotions, and inventory management.

The hierarchical clustering analysis further supported these findings by grouping similar items based on purchasing patterns. While association rules identify directional relationships between items, clustering highlights broader groupings of related products. Together, these methods provide a more comprehensive understanding of customer behavior.

Overall, this analysis demonstrates how data mining techniques can uncover meaningful patterns in transaction data and support data-driven decision-making in retail environments.