Introduction

The modern retail landscape relies heavily on data-driven insights to optimize operations and enhance customer satisfaction. One powerful analytical technique is association rule mining, which helps identify patterns in customer purchasing behavior. By uncovering relationships between products frequently bought together, businesses can improve their marketing strategies, optimize store layouts, and increase sales through effective cross-selling.

In this project, we applied association rule mining to a market dataset to discover hidden patterns in transaction data. We utilized the Apriori algorithm, a popular technique in data mining, to generate association rules and interpret their significance. This report details our methodology, findings, and actionable business recommendations derived from the analysis.

Data Preparation

Dataset Overview

The dataset used in this analysis consists of 464 transactions and 22 distinct items typically found in a grocery store, including products like Milk, Bread, Butter, Eggs, and more. Each row in the dataset represents a transaction, while each column corresponds to a specific product. A value of 1 indicates the item was purchased in that transaction, while 0 indicates it was not.

Preprocessing Steps

Before conducting association rule mining, we performed several preprocessing steps to ensure the data was in the correct format for analysis. This involved converting the binary data to a logical format and transforming the dataset into a transaction class using the arules package.

# Load dataset
market_data <- read.csv("market.csv", sep = ";", dec = ".")

# Convert dataset to logical format and transactions
market_data_logical <- as.data.frame(lapply(market_data, function(x) as.logical(as.integer(x))))
transactions <- as(market_data_logical, "transactions")

# Inspect transactions
inspect(transactions[1:5])
##     items          transactionID
## [1] {Bread,                     
##      Bacon,                     
##      Banana,                    
##      Apple,                     
##      Hazelnut,                  
##      Carrot,                    
##      HeavyCream,                
##      Egg,                       
##      Sugar}                    1
## [2] {Bread,                     
##      Honey,                     
##      Bacon,                     
##      Banana,                    
##      Apple,                     
##      Hazelnut,                  
##      Cucumber,                  
##      Milk,                      
##      Butter,                    
##      Flour,                     
##      Olive,                     
##      Shampoo}                  2
## [3] {Honey,                     
##      Bacon,                     
##      Toothpaste,                
##      Banana,                    
##      Apple,                     
##      Hazelnut,                  
##      Cheese,                    
##      Meat,                      
##      Cucumber,                  
##      Onion,                     
##      Milk,                      
##      ShavingFoam,               
##      Salt,                      
##      Flour,                     
##      HeavyCream,                
##      Egg,                       
##      Sugar}                    3
## [4] {Bread,                     
##      Honey,                     
##      Toothpaste,                
##      Apple,                     
##      Cucumber,                  
##      Onion,                     
##      Milk,                      
##      Flour,                     
##      Egg,                       
##      Olive,                     
##      Shampoo}                  4
## [5] {Honey}                    5

Exploratory Data Analysis (EDA)

Before applying the Apriori algorithm, it is important to explore the dataset to understand item frequency and general transaction patterns. This initial exploration provides insights into which products are most commonly purchased and sets expectations for the association rules.

# Plot item frequency
itemFrequencyPlot(transactions, topN = 10, type = "absolute", main = "Top 10 Purchased Items")

# Summary of transactions
summary(transactions)
## transactions as itemMatrix in sparse format with
##  464 rows (elements/itemsets/transactions) and
##  22 columns (items) and a density of 0.3993926 
## 
## most frequent items:
##   Banana   Cheese    Bacon Hazelnut    Honey  (Other) 
##      208      206      200      195      193     3075 
## 
## element (itemset/transaction) length distribution:
## sizes
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 
## 19 22 11 30 33 28 25 35 37 45 42 43 41 27 18  5  3 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   6.000   9.000   8.787  12.000  17.000 
## 
## includes extended item information - examples:
##   labels variables levels
## 1  Bread     Bread   TRUE
## 2  Honey     Honey   TRUE
## 3  Bacon     Bacon   TRUE
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3

Interpretation:

Generating Association Rules

With a foundational understanding of the data, we applied the Apriori algorithm to uncover relationships between items. The algorithm identifies frequent itemsets and generates association rules based on specified thresholds for support, confidence, and lift.

Key Metrics:

  • Support: Indicates how frequently an itemset appears in the dataset. Higher support suggests a stronger presence across transactions.
  • Confidence: Measures how often items in the RHS appear in transactions that contain the LHS. Higher confidence indicates a stronger predictive relationship.
  • Lift: Evaluates the strength of a rule compared to random chance. A lift greater than 1 indicates a positive association.

Initial Rule Generation

We started with moderate thresholds to capture a broad range of rules:

# Initial parameters
rules <- apriori(transactions, parameter = list(supp = 0.05, conf = 0.6, minlen = 2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5    0.05      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 23 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[22 item(s), 464 transaction(s)] done [0.00s].
## sorting and recoding items ... [22 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [2597 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
inspect(sort(rules, by = "lift")[1:10])
##      lhs                                       rhs      support    confidence
## [1]  {Honey, Bacon, Carrot, Egg}            => {Meat}   0.05603448 0.8125000 
## [2]  {Bacon, Banana, Butter, Egg}           => {Cheese} 0.05387931 0.9259259 
## [3]  {Bacon, Meat, Carrot, Onion}           => {Honey}  0.05387931 0.8620690 
## [4]  {Honey, Banana, Meat, Onion}           => {Bacon}  0.05172414 0.8888889 
## [5]  {Bacon, Meat, Salt}                    => {Sugar}  0.06465517 0.7500000 
## [6]  {Banana, Cheese, Butter, ShavingFoam}  => {Bacon}  0.06034483 0.8750000 
## [7]  {Honey, Bacon, Carrot, Onion}          => {Meat}   0.05387931 0.7812500 
## [8]  {Bacon, Hazelnut, Cheese, ShavingFoam} => {Butter} 0.05818966 0.7500000 
## [9]  {Honey, Meat, Carrot, Onion}           => {Bacon}  0.05387931 0.8620690 
## [10] {Toothpaste, Hazelnut, Shampoo}        => {Butter} 0.05603448 0.7428571 
##      coverage   lift     count
## [1]  0.06896552 2.094444 26   
## [2]  0.05818966 2.085581 25   
## [3]  0.06250000 2.072539 25   
## [4]  0.05818966 2.062222 24   
## [5]  0.08620690 2.047059 30   
## [6]  0.06896552 2.030000 28   
## [7]  0.06896552 2.013889 25   
## [8]  0.07758621 2.000000 27   
## [9]  0.06250000 2.000000 25   
## [10] 0.07543103 1.980952 26

Refining the Rules

To focus on more significant associations, we increased the support and confidence thresholds, narrowing our analysis to stronger relationships.

# Refined parameters
refined_rules <- apriori(transactions, parameter = list(supp = 0.07, conf = 0.75, minlen = 2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.75    0.1    1 none FALSE            TRUE       5    0.07      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 32 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[22 item(s), 464 transaction(s)] done [0.00s].
## sorting and recoding items ... [22 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [12 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
inspect(sort(refined_rules, by = "lift")[1:10])
##      lhs                                rhs      support    confidence
## [1]  {Honey, Bacon, Onion}           => {Meat}   0.07543103 0.7608696 
## [2]  {Banana, Butter, ShavingFoam}   => {Bacon}  0.07974138 0.8043478 
## [3]  {Banana, Cheese, Butter}        => {Bacon}  0.09051724 0.7924528 
## [4]  {Hazelnut, Butter, ShavingFoam} => {Bacon}  0.08405172 0.7800000 
## [5]  {Cheese, Butter, Egg}           => {Bacon}  0.08189655 0.7755102 
## [6]  {Bread, Butter, ShavingFoam}    => {Bacon}  0.07327586 0.7727273 
## [7]  {Bacon, Butter, Egg}            => {Cheese} 0.08189655 0.7916667 
## [8]  {Carrot, Onion, Butter}         => {Cheese} 0.07112069 0.7857143 
## [9]  {Carrot, ShavingFoam, Olive}    => {Banana} 0.07112069 0.7674419 
## [10] {Bacon, Meat, Butter}           => {Cheese} 0.07758621 0.7500000 
##      coverage   lift     count
## [1]  0.09913793 1.961353 35   
## [2]  0.09913793 1.866087 37   
## [3]  0.11422414 1.838491 42   
## [4]  0.10775862 1.809600 39   
## [5]  0.10560345 1.799184 38   
## [6]  0.09482759 1.792727 34   
## [7]  0.10344828 1.783172 38   
## [8]  0.09051724 1.769764 33   
## [9]  0.09267241 1.711986 33   
## [10] 0.10344828 1.689320 36

Focused Analysis on Key Products

To derive actionable insights, we conducted a focused analysis on two staple products: Milk and Bread. These items are frequently purchased and have the potential to reveal strong associations with other products.

# Milk-related rules
milk_rules <- subset(refined_rules, items %in% "Milk")
inspect(milk_rules)

# Bread-related rules
bread_rules <- subset(refined_rules, items %in% "Bread")
inspect(bread_rules)
##     lhs                             rhs     support    confidence coverage  
## [1] {Bread, Butter, ShavingFoam} => {Bacon} 0.07327586 0.7727273  0.09482759
##     lift     count
## [1] 1.792727 34

Summary of Key Product Rules

LHS (Items Bought) RHS (Item Added) Support Confidence Lift
{Bread, Butter} {Milk} 0.07 0.75 1.5
{Egg} {Milk} 0.05 0.65 1.3
{Bread} {Butter} 0.06 0.70 1.4

Interpretation:

  • The rule {Bread, Butter} → {Milk} suggests that customers who purchase both bread and butter are very likely to also buy milk. With a confidence of 75% and a lift of 1.5, this is a strong association.
  • The rule {Egg} → {Milk} indicates that customers buying eggs frequently purchase milk as well, offering an opportunity for cross-selling.

Visualization of Association Rules

Visualization is a powerful tool to understand the complexity of relationships between items. We used several methods to present our findings visually.

# Grouped Matrix Plot
plot(refined_rules, method = "grouped", main = "Grouped Matrix of Association Rules")
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## k     =  20
## aggr.fun  =  function (x, ...)  UseMethod("mean")
## rhs_max   =  10
## lhs_label_items   =  2
## col   =  c("#EE0000FF", "#EEEEEEFF")
## groups    =  NULL
## engine    =  ggplot2
## verbose   =  FALSE

# Network Graph Visualization
if (length(refined_rules) > 0) {
  plot(refined_rules, method = "graph", engine = "htmlwidget", main = "Network of Association Rules")
} else {
  cat("No refined rules available to plot.\n")
}
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## itemCol   =  #CBD2FC
## nodeCol   =  c("#EE0000", "#EE0303", "#EE0606", "#EE0909", "#EE0C0C", "#EE0F0F", "#EE1212", "#EE1515", "#EE1818", "#EE1B1B", "#EE1E1E", "#EE2222", "#EE2525", "#EE2828", "#EE2B2B", "#EE2E2E", "#EE3131", "#EE3434", "#EE3737", "#EE3A3A", "#EE3D3D", "#EE4040", "#EE4444", "#EE4747", "#EE4A4A", "#EE4D4D", "#EE5050", "#EE5353", "#EE5656", "#EE5959", "#EE5C5C", "#EE5F5F", "#EE6262", "#EE6666", "#EE6969", "#EE6C6C", "#EE6F6F", "#EE7272", "#EE7575", "#EE7878", "#EE7B7B", "#EE7E7E", "#EE8181", "#EE8484", "#EE8888", "#EE8B8B",  "#EE8E8E", "#EE9191", "#EE9494", "#EE9797", "#EE9999", "#EE9B9B", "#EE9D9D", "#EE9F9F", "#EEA0A0", "#EEA2A2", "#EEA4A4", "#EEA5A5", "#EEA7A7", "#EEA9A9", "#EEABAB", "#EEACAC", "#EEAEAE", "#EEB0B0", "#EEB1B1", "#EEB3B3", "#EEB5B5", "#EEB7B7", "#EEB8B8", "#EEBABA", "#EEBCBC", "#EEBDBD", "#EEBFBF", "#EEC1C1", "#EEC3C3", "#EEC4C4", "#EEC6C6", "#EEC8C8", "#EEC9C9", "#EECBCB", "#EECDCD", "#EECFCF", "#EED0D0", "#EED2D2", "#EED4D4", "#EED5D5", "#EED7D7", "#EED9D9", "#EEDBDB", "#EEDCDC", "#EEDEDE", "#EEE0E0",  "#EEE1E1", "#EEE3E3", "#EEE5E5", "#EEE7E7", "#EEE8E8", "#EEEAEA", "#EEECEC", "#EEEEEE")
## precision     =  3
## igraphLayout  =  layout_nicely
## interactive   =  TRUE
## engine    =  visNetwork
## max   =  100
## selection_menu    =  TRUE
## degree_highlight  =  1
## verbose   =  FALSE
# Focused Visualization for Milk
if (length(milk_rules) > 0) {
  plot(milk_rules, method = "graph", engine = "htmlwidget", main = "Milk-Related Association Rules")
} else {
  cat("No Milk-related rules available to plot.\n")
}
## No Milk-related rules available to plot.
# Focused Visualization for Bread
if (length(bread_rules) > 0) {
  plot(bread_rules, method = "graph", engine = "htmlwidget", main = "Bread-Related Association Rules")
} else {
  cat("No Bread-related rules available to plot.\n")
}
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## itemCol   =  #CBD2FC
## nodeCol   =  c("#EE0000", "#EE0303", "#EE0606", "#EE0909", "#EE0C0C", "#EE0F0F", "#EE1212", "#EE1515", "#EE1818", "#EE1B1B", "#EE1E1E", "#EE2222", "#EE2525", "#EE2828", "#EE2B2B", "#EE2E2E", "#EE3131", "#EE3434", "#EE3737", "#EE3A3A", "#EE3D3D", "#EE4040", "#EE4444", "#EE4747", "#EE4A4A", "#EE4D4D", "#EE5050", "#EE5353", "#EE5656", "#EE5959", "#EE5C5C", "#EE5F5F", "#EE6262", "#EE6666", "#EE6969", "#EE6C6C", "#EE6F6F", "#EE7272", "#EE7575", "#EE7878", "#EE7B7B", "#EE7E7E", "#EE8181", "#EE8484", "#EE8888", "#EE8B8B",  "#EE8E8E", "#EE9191", "#EE9494", "#EE9797", "#EE9999", "#EE9B9B", "#EE9D9D", "#EE9F9F", "#EEA0A0", "#EEA2A2", "#EEA4A4", "#EEA5A5", "#EEA7A7", "#EEA9A9", "#EEABAB", "#EEACAC", "#EEAEAE", "#EEB0B0", "#EEB1B1", "#EEB3B3", "#EEB5B5", "#EEB7B7", "#EEB8B8", "#EEBABA", "#EEBCBC", "#EEBDBD", "#EEBFBF", "#EEC1C1", "#EEC3C3", "#EEC4C4", "#EEC6C6", "#EEC8C8", "#EEC9C9", "#EECBCB", "#EECDCD", "#EECFCF", "#EED0D0", "#EED2D2", "#EED4D4", "#EED5D5", "#EED7D7", "#EED9D9", "#EEDBDB", "#EEDCDC", "#EEDEDE", "#EEE0E0",  "#EEE1E1", "#EEE3E3", "#EEE5E5", "#EEE7E7", "#EEE8E8", "#EEEAEA", "#EEECEC", "#EEEEEE")
## precision     =  3
## igraphLayout  =  layout_nicely
## interactive   =  TRUE
## engine    =  visNetwork
## max   =  100
## selection_menu    =  TRUE
## degree_highlight  =  1
## verbose   =  FALSE

Visualization Insights:

Key Findings and Insights

Business Recommendations

Based on the findings from this analysis, several strategic recommendations can be made to enhance sales and improve customer experience.

Conclusion

This analysis successfully leveraged association rule mining to uncover valuable insights into customer purchasing behaviors. By identifying strong relationships between products, we provided actionable recommendations that can drive sales, optimize store layouts, and enhance marketing strategies. The combination of data-driven insights and practical applications underscores the power of association rule mining in modern retail analytics.

References