In this third paper, I will perform Association Rules Analysis (also known as Market Basket Analysis) on a grocery store dataset. My objective is to use the Apriori algorithm to identify relationships between products that customers frequently buy together. By analyzing metrics such as Support, Confidence, and Lift, I will discover meaningful patterns in consumer behavior.
# Loading required libraries
library(arules)
library(arulesViz)
library(tidyverse)
# Loading the built-in Groceries dataset
data("Groceries")
# Checking the structure of the data
summary(Groceries)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels level2 level1
## 1 frankfurter sausage meat and sausage
## 2 sausage sausage meat and sausage
## 3 liver loaf sausage meat and sausage
# We apply the Apriori algorithm.
# We want rules that appear in at least 0.1% of transactions (supp=0.001)
# and have at least 50% confidence (conf=0.5).
rules <- apriori(Groceries, parameter = list(supp = 0.001, conf = 0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [5668 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Let's see how many rules were generated
print(rules)
## set of 5668 rules
# Sorting the rules by 'lift' to find the most interesting patterns
rules_sorted <- sort(rules, by = "lift", decreasing = TRUE)
# Showing the top 10 strongest rules
inspect(rules_sorted[1:10])
## lhs rhs support confidence coverage lift count
## [1] {Instant food products,
## soda} => {hamburger meat} 0.001220132 0.6315789 0.001931876 18.99565 12
## [2] {soda,
## popcorn} => {salty snack} 0.001220132 0.6315789 0.001931876 16.69779 12
## [3] {flour,
## baking powder} => {sugar} 0.001016777 0.5555556 0.001830198 16.40807 10
## [4] {ham,
## processed cheese} => {white bread} 0.001931876 0.6333333 0.003050330 15.04549 19
## [5] {whole milk,
## Instant food products} => {hamburger meat} 0.001525165 0.5000000 0.003050330 15.03823 15
## [6] {other vegetables,
## curd,
## yogurt,
## whipped/sour cream} => {cream cheese } 0.001016777 0.5882353 0.001728521 14.83409 10
## [7] {processed cheese,
## domestic eggs} => {white bread} 0.001118454 0.5238095 0.002135231 12.44364 11
## [8] {tropical fruit,
## other vegetables,
## yogurt,
## white bread} => {butter} 0.001016777 0.6666667 0.001525165 12.03058 10
## [9] {hamburger meat,
## yogurt,
## whipped/sour cream} => {butter} 0.001016777 0.6250000 0.001626843 11.27867 10
## [10] {tropical fruit,
## other vegetables,
## whole milk,
## yogurt,
## domestic eggs} => {butter} 0.001016777 0.6250000 0.001626843 11.27867 10
In this step, I used the Apriori algorithm to generate association rules. I set a minimum support of 0.001 and a confidence of 0.5. To find the most meaningful relationships, I sorted the rules by Lift. A lift value greater than 1 indicates that the items are not just bought together by chance, but they have a strong association.
# Scatter plot of rules to see Support vs Confidence
plot(rules, method = "scatter", measure = c("support", "confidence"), shading = "lift")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
This scatter plot visualizes all generated rules. The x-axis (support) shows how common the rule is, while the y-axis (confidence) shows its reliability. The darker points represent a higher Lift, indicating a very strong relationship between products. Most of my rules have high confidence but low support, which is common in large transaction datasets.
# Graph based visualization for the top 10 rules
plot(rules_sorted[1:10], method = "graph")
This network graph shows the top 10 rules based on their Lift values.
The circles (nodes) represent the rules, and the arrows show the
direction of the association (from “if” to “then”). For example, if we
see an arrow from “citrus fruit” to “whole milk”, it means customers
buying citrus fruits are likely to buy milk as well. The size of the
circle represents the support, and the color intensity represents the
lift.
In this third paper, I successfully performed a Market Basket Analysis on the Groceries dataset using the Apriori algorithm.
Finding patterns: The analysis revealed that items like whole milk and other vegetables are central to customer behavior, frequently appearing as the “consequent” in strong rules.
Strategic Insights: By looking at high-lift rules, such as associations involving specialized items (e.g., herbs or citrus fruits), the store can optimize shelf placement or create bundle offers to increase sales.
Validation: Using visualizations like the Scatter Plot and Network Graph, I confirmed that our rules are statistically significant, with high confidence and lift values.
I utilized an AI Large Language Model as a technical assistant during the development of this project. The AI was used for the following purposes:
Syntax Support: Helping with the specific parameters of the apriori function and the arulesViz plotting methods.
Structural Guidance: Assistance in organizing the R Markdown report into logical sections (Data Prep, Analysis, Visualization).
Language Refinement: Ensuring the English explanations of technical metrics (Support, Confidence, Lift) are clear and academic.
All interpretations of the grocery store patterns and the final decision on rule thresholds were made by me as the author.