Grocery Market Basket Analysis in R

Overview

The Grocery Data Set contains 9,835 transactions, where each row represents a customer receipt and each column contains an item purchased during that transaction. This type of analysis is called Market Basket Analysis because it examines which products are frequently purchased together by customers.

The objectives of this assignment are to:

  • Perform association rule mining using R

  • Calculate and interpret support, confidence, and lift values

  • Identify the top 10 association rules based on lift

  • Conduct a simple cluster analysis for extra credit

R Code: Association Rule Mining

# Load libraries
library(arules)
## Warning: package 'arules' was built under R version 4.5.3
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(arulesViz)
## Warning: package 'arulesViz' was built under R version 4.5.3
library(cluster)
## Warning: package 'cluster' was built under R version 4.5.3
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.5.3
## Loading required package: ggplot2
## Welcome to factoextra!
## Want to learn more? See two factoextra-related books at https://www.datanovia.com/en/product/practical-guide-to-principal-component-methods-in-r/
# Load grocery dataset
raw_data <- read.csv(
  "C:/Users/rbron/Downloads/GroceryDataSet.csv",
  header = FALSE,
  stringsAsFactors = FALSE
)

# Convert rows into transaction format
transactions_list <- apply(raw_data, 1, function(x) {
  
# Remove empty values and NA values
x <- x[x != "" & !is.na(x)]
  
# Convert to character vector
as.character(x)
})

# Convert to transactions object
transactions <- as(transactions_list, "transactions")

# Display summary of transactions
summary(transactions)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics
# Inspect first five transactions
inspect(transactions[1:5])
##     items                      
## [1] {citrus fruit,             
##      margarine,                
##      ready soups,              
##      semi-finished bread}      
## [2] {coffee,                   
##      tropical fruit,           
##      yogurt}                   
## [3] {whole milk}               
## [4] {cream cheese ,            
##      meat spreads,             
##      pip fruit,                
##      yogurt}                   
## [5] {condensed milk,           
##      long life bakery product, 
##      other vegetables,         
##      whole milk}

Generate Association Rules

# Generate association rules using the Apriori algorithm
rules <- apriori(
  transactions,
  parameter = list(
    supp = 0.001,
    conf = 0.20,
    minlen = 2
  )
)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5   0.001      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [21633 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].
# Display summary of generated rules
summary(rules)
## set of 21633 rules
## 
## rule length distribution (lhs + rhs):sizes
##    2    3    4    5    6 
##  620 9337 9824 1792   60 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   4.000   3.599   4.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift        
##  Min.   :0.001017   Min.   :0.2000   Min.   :0.001017   Min.   : 0.8028  
##  1st Qu.:0.001118   1st Qu.:0.2632   1st Qu.:0.002745   1st Qu.: 2.1178  
##  Median :0.001322   Median :0.3548   Median :0.004169   Median : 2.7571  
##  Mean   :0.001948   Mean   :0.3967   Mean   :0.005840   Mean   : 3.0214  
##  3rd Qu.:0.001932   3rd Qu.:0.5000   3rd Qu.:0.006101   3rd Qu.: 3.6148  
##  Max.   :0.074835   Max.   :1.0000   Max.   :0.255516   Max.   :35.7158  
##      count       
##  Min.   : 10.00  
##  1st Qu.: 11.00  
##  Median : 13.00  
##  Mean   : 19.15  
##  3rd Qu.: 19.00  
##  Max.   :736.00  
## 
## mining info:
##          data ntransactions support confidence
##  transactions          9835   0.001        0.2
##                                                                                  call
##  apriori(data = transactions, parameter = list(supp = 0.001, conf = 0.2, minlen = 2))

Top 10 Rules by Lift

# Sort association rules by lift
rules_sorted <- sort(
  rules,
  by = "lift",
  decreasing = TRUE
)

# Display the top 10 rules
inspect(head(rules_sorted, 10))
##      lhs                                rhs                     support    
## [1]  {bottled beer, red/blush wine}  => {liquor}                0.001931876
## [2]  {hamburger meat, soda}          => {Instant food products} 0.001220132
## [3]  {ham, white bread}              => {processed cheese}      0.001931876
## [4]  {bottled beer, liquor}          => {red/blush wine}        0.001931876
## [5]  {Instant food products, soda}   => {hamburger meat}        0.001220132
## [6]  {curd, sugar}                   => {flour}                 0.001118454
## [7]  {baking powder, sugar}          => {flour}                 0.001016777
## [8]  {processed cheese, white bread} => {ham}                   0.001931876
## [9]  {fruit/vegetable juice, ham}    => {processed cheese}      0.001118454
## [10] {margarine, sugar}              => {flour}                 0.001626843
##      confidence coverage    lift     count
## [1]  0.3958333  0.004880529 35.71579 19   
## [2]  0.2105263  0.005795628 26.20919 12   
## [3]  0.3800000  0.005083884 22.92822 19   
## [4]  0.4130435  0.004677173 21.49356 19   
## [5]  0.6315789  0.001931876 18.99565 12   
## [6]  0.3235294  0.003457041 18.60767 11   
## [7]  0.3125000  0.003253686 17.97332 10   
## [8]  0.4634146  0.004168785 17.80345 19   
## [9]  0.2894737  0.003863752 17.46610 11   
## [10] 0.2962963  0.005490595 17.04137 16

Export Top Rules to Data Frame

# Convert association rules to a data frame
rules_df <- as(rules_sorted, "data.frame")

# Store the top 10 rules
top10_rules <- head(rules_df, 10)

# Display the top 10 rules
print(top10_rules)
##                                                  rules     support confidence
## 633          {bottled beer,red/blush wine} => {liquor} 0.001931876  0.3958333
## 696   {hamburger meat,soda} => {Instant food products} 0.001220132  0.2105263
## 1489           {ham,white bread} => {processed cheese} 0.001931876  0.3800000
## 632          {bottled beer,liquor} => {red/blush wine} 0.001931876  0.4130435
## 695   {Instant food products,soda} => {hamburger meat} 0.001220132  0.6315789
## 2022                           {curd,sugar} => {flour} 0.001118454  0.3235294
## 1916                  {baking powder,sugar} => {flour} 0.001016777  0.3125000
## 1488           {processed cheese,white bread} => {ham} 0.001931876  0.4634146
## 1492 {fruit/vegetable juice,ham} => {processed cheese} 0.001118454  0.2894737
## 2025                      {margarine,sugar} => {flour} 0.001626843  0.2962963
##         coverage     lift count
## 633  0.004880529 35.71579    19
## 696  0.005795628 26.20919    12
## 1489 0.005083884 22.92822    19
## 632  0.004677173 21.49356    19
## 695  0.001931876 18.99565    12
## 2022 0.003457041 18.60767    11
## 1916 0.003253686 17.97332    10
## 1488 0.004168785 17.80345    19
## 1492 0.003863752 17.46610    11
## 2025 0.005490595 17.04137    16

Visualize Association Rules

# Create a scatter plot of association rules
plot(
  rules_sorted,
  method = "scatterplot",
  measure = c("support", "confidence"),
  shading = "lift"
)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

# Create a graph visualization of the top 20 rules
plot(
  head(rules_sorted, 20),
  method = "graph"
)

Interpretation of Association Rule Metrics

Support

Support represents how often a combination of items appears within all transactions in the dataset. It is found by dividing the number of transactions containing both items by the total number of transactions. Higher support values indicate that the item combination occurs more frequently among customers.

The summary results showed a maximum support value of approximately 0.0748, meaning the most common rules appeared in about 7.5% of all grocery receipts. Most of the strongest association rules had support values between 0.001 and 0.002, indicating that these combinations were less common overall but still meaningful.

For example, the rule:

{bottled beer, red/blush wine} => {liquor}

had a support value of 0.00193. This means that roughly 0.19% of all transactions included bottled beer, red/blush wine, and liquor together.

Confidence

Confidence measures the likelihood that a customer purchases item B when item A is already in the shopping basket. It is calculated by dividing the transactions containing both items by the transactions containing the item on the left-hand side of the rule.

The analysis used a minimum confidence threshold of 0.20, so every generated rule had at least a 20% probability of occurring. Across the 21,633 generated rules, the average confidence value was about 0.397, while the highest confidence reached 1.00.

One strong example was the rule:

{Instant food products, soda} => {hamburger meat}

This rule had a confidence value of 0.632, meaning that about 63% of customers who bought instant food products and soda also purchased hamburger meat during the same visit. This suggests a strong relationship between these products.

Lift

Lift measures the strength of an association compared to what would be expected if the products were purchased independently. A lift value greater than 1 indicates that the items are bought together more often than expected by chance.

The average lift value across all rules was approximately 3.02, showing that many product combinations had meaningful associations. The highest lift value was 35.72, indicating an especially strong purchasing relationship.

The strongest rule identified was:

{bottled beer, red/blush wine} => {liquor}

with a lift value of 35.72. This means that customers purchasing bottled beer and red/blush wine were more than 35 times more likely to also buy liquor compared to the average shopper.

Other strong association rules included:

  • {ham, white bread} => {processed cheese} with a lift of 22.93

  • {Instant food products, soda} => {hamburger meat} with a lift of 18.99

  • {processed cheese, white bread} => {ham} with a lift of 17.80

These rules suggest that customers often purchase related food products and meal ingredients together.

Visualization

The scatter plot displayed support on the x-axis and confidence on the y-axis, while darker colors represented higher lift values. Most rules appeared at low support levels, meaning many item combinations occurred infrequently. However, several rules showed high confidence and high lift, revealing strong purchasing relationships even when the combinations were less common.

The network graph further illustrated the strongest product associations. Alcohol-related items such as bottled beer, liquor, and red/blush wine formed one of the clearest clusters, reflecting very strong relationships between these products. Other noticeable connections involved ham, processed cheese, white bread, soda, hamburger meat, flour, sugar, and baking powder, indicating that shoppers commonly purchased complementary ingredients together.

Example Interpretation of Top Rules

After running the analysis, the rules with the highest lift showed the strongest relationships between products that customers purchased together.

Example Rule

{yogurt, tropical fruit} → {whole milk}

Interpretation

Customers who buy yogurt and tropical fruit are also likely to buy whole milk. A lift value greater than 1 shows that this relationship is stronger than random chance and represents a meaningful shopping pattern.

Businesses can use this information to:

  • Place related products near each other in stores

  • Create product bundle promotions

  • Recommend related products online

  • Increase cross-selling opportunities

Extra Credit: Simple Cluster Analysis

Cluster analysis groups transactions with similar purchasing patterns.

Convert Transactions to Matrix

# Convert sparse transaction data into a regular matrix
transaction_matrix <- as(transactions, "matrix")

# Run k-means clustering
set.seed(123)

clusters <- kmeans(
  transaction_matrix,
  centers = 3
)

# Display the first few cluster assignments
head(clusters$cluster)
## [1] 1 1 3 1 3 3
# Display the size of each cluster
clusters$size
## [1] 6554 1085 2196

Visualize Clusters

# Visualize the k-means clusters
fviz_cluster(
  clusters,
  data = transaction_matrix
)

Cluster Analysis Interpretation

The k-means cluster analysis grouped the grocery transactions into three customer segments based on similar shopping behaviors.

The cluster sizes were:

Cluster 1 was the largest group, showing that many customers had similar purchasing patterns. Cluster 2 was the smallest and may represent customers with more unique shopping habits. Cluster 3 represented another distinct group of shoppers.

The cluster plot showed the three groups using different colors. Although some overlap existed, the visualization still showed clear differences between customer purchasing patterns.

Businesses can use these results to improve marketing, product placement, and customer recommendations.

Conclusion

This project used market basket analysis to explore customer purchasing patterns within the Grocery Data Set using R. By applying the Apriori algorithm, the analysis identified products that were frequently purchased together and measured the strength of these relationships using support, confidence, and lift.

The results showed several strong associations between grocery items, especially among products that are commonly used together or purchased as part of the same meal. Visualizations such as scatter plots and network graphs helped display these relationships more clearly.

The cluster analysis also divided customers into three groups based on similar shopping behaviors, showing that different types of customers have different purchasing habits.

Overall, the analysis demonstrated how association rule mining and clustering can provide valuable business insights. Companies can use these findings to improve marketing strategies, organize store layouts, recommend related products, and increase sales through cross-selling opportunities.