Introduction

In this third paper, I will perform Association Rules Analysis (also known as Market Basket Analysis) on a grocery store dataset. My objective is to use the Apriori algorithm to identify relationships between products that customers frequently buy together. By analyzing metrics such as Support, Confidence, and Lift, I will discover meaningful patterns in consumer behavior.

# Loading required libraries
library(arules)
library(arulesViz)
library(tidyverse)

# Loading the built-in Groceries dataset
data("Groceries")

# Checking the structure of the data
summary(Groceries)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##        labels  level2           level1
## 1 frankfurter sausage meat and sausage
## 2     sausage sausage meat and sausage
## 3  liver loaf sausage meat and sausage
# We apply the Apriori algorithm. 
# We want rules that appear in at least 0.1% of transactions (supp=0.001) 
# and have at least 50% confidence (conf=0.5).
rules <- apriori(Groceries, parameter = list(supp = 0.001, conf = 0.5))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [5668 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Let's see how many rules were generated
print(rules)
## set of 5668 rules
# Sorting the rules by 'lift' to find the most interesting patterns
rules_sorted <- sort(rules, by = "lift", decreasing = TRUE)

# Showing the top 10 strongest rules
inspect(rules_sorted[1:10])
##      lhs                         rhs                  support confidence    coverage     lift count
## [1]  {Instant food products,                                                                       
##       soda}                   => {hamburger meat} 0.001220132  0.6315789 0.001931876 18.99565    12
## [2]  {soda,                                                                                        
##       popcorn}                => {salty snack}    0.001220132  0.6315789 0.001931876 16.69779    12
## [3]  {flour,                                                                                       
##       baking powder}          => {sugar}          0.001016777  0.5555556 0.001830198 16.40807    10
## [4]  {ham,                                                                                         
##       processed cheese}       => {white bread}    0.001931876  0.6333333 0.003050330 15.04549    19
## [5]  {whole milk,                                                                                  
##       Instant food products}  => {hamburger meat} 0.001525165  0.5000000 0.003050330 15.03823    15
## [6]  {other vegetables,                                                                            
##       curd,                                                                                        
##       yogurt,                                                                                      
##       whipped/sour cream}     => {cream cheese }  0.001016777  0.5882353 0.001728521 14.83409    10
## [7]  {processed cheese,                                                                            
##       domestic eggs}          => {white bread}    0.001118454  0.5238095 0.002135231 12.44364    11
## [8]  {tropical fruit,                                                                              
##       other vegetables,                                                                            
##       yogurt,                                                                                      
##       white bread}            => {butter}         0.001016777  0.6666667 0.001525165 12.03058    10
## [9]  {hamburger meat,                                                                              
##       yogurt,                                                                                      
##       whipped/sour cream}     => {butter}         0.001016777  0.6250000 0.001626843 11.27867    10
## [10] {tropical fruit,                                                                              
##       other vegetables,                                                                            
##       whole milk,                                                                                  
##       yogurt,                                                                                      
##       domestic eggs}          => {butter}         0.001016777  0.6250000 0.001626843 11.27867    10

In this step, I used the Apriori algorithm to generate association rules. I set a minimum support of 0.001 and a confidence of 0.5. To find the most meaningful relationships, I sorted the rules by Lift. A lift value greater than 1 indicates that the items are not just bought together by chance, but they have a strong association.

# Scatter plot of rules to see Support vs Confidence
plot(rules, method = "scatter", measure = c("support", "confidence"), shading = "lift")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

This scatter plot visualizes all generated rules. The x-axis (support) shows how common the rule is, while the y-axis (confidence) shows its reliability. The darker points represent a higher Lift, indicating a very strong relationship between products. Most of my rules have high confidence but low support, which is common in large transaction datasets.

# Graph based visualization for the top 10 rules
plot(rules_sorted[1:10], method = "graph")

This network graph shows the top 10 rules based on their Lift values. The circles (nodes) represent the rules, and the arrows show the direction of the association (from “if” to “then”). For example, if we see an arrow from “citrus fruit” to “whole milk”, it means customers buying citrus fruits are likely to buy milk as well. The size of the circle represents the support, and the color intensity represents the lift.

Conclusion

In this third paper, I successfully performed a Market Basket Analysis on the Groceries dataset using the Apriori algorithm.

AI Usage Statement

I utilized an AI Large Language Model as a technical assistant during the development of this project. The AI was used for the following purposes:

All interpretations of the grocery store patterns and the final decision on rule thresholds were made by me as the author.