What are the most important steps in basket analysis with association rules?

Abstract

Basket analysis using association rules is a key data mining technique for identifying relationships between frequently purchased items. This paper ranks and analyzes the most critical steps in basket analysis, detailing their implementation in R. We discuss rule generation, evaluation metrics, visualization, and their impact on business decision-making. Additionally, we compare Apriori, Eclat, and FP-Growth algorithms, highlighting their advantages and limitations.

Introduction

Market Basket Analysis (MBA) is a widely used technique in data mining to uncover purchasing patterns by analyzing transaction data. Association rule learning extracts patterns that reveal relationships between items frequently purchased together, which can be used to enhance product placement, cross-selling strategies, and inventory management. This paper emphasizes the steps involved in MBA and their practical application in R, offering an in-depth comparison of the three primary algorithms: Apriori, Eclat, and FP-Growth.

Literature Review

Association rule mining was first introduced by Agrawal et al. (1993) and has since evolved significantly. The three primary algorithms used are:

Apriori Algorithm: Uses an iterative approach to generate frequent itemsets and extract association rules. It is computationally expensive but widely used in practice.
Eclat Algorithm: Employs a depth-first search strategy, making it faster for dense datasets.
FP-Growth Algorithm: Builds a compact tree structure, eliminating the need for candidate generation, making it efficient for large datasets.

Recent research (Hahsler & Karpienko, 2017) has shown that FP-Growth outperforms Apriori in scalability, while Eclat is efficient for small, dense datasets. This paper implements these methods and compares their effectiveness.

Ranking the Most Important Steps in Basket Analysis

1. Data Preparation (Most important)

Definition: Data preparation involves cleaning and structuring transactional data into a suitable format for rule mining.
Importance: The quality of the input data significantly affects the results of association rule mining. Without proper data preparation, insights can be misleading or incomplete.

-Implementation in R:

We used the Groceries dataset, which is pre-installed in the arules R package. Each transaction is represented as a list of purchased items. The items are encoded as binary values (1 for purchased, 0 for not purchased). Data inspection helps ensure that it’s well-structured before rule generation.

**NB:** While we used the Groceries dataset in this paper for demonstration, other datasets can also be used, as long as they can be formatted into a transactional list. For example, a custom dataset in CSV format, where each row represents a transaction and each column represents a product, can be converted into a transactional list format. This enables the use of association rule mining techniques on a wide variety of datasets.
The dataset needs to be transformed into a transactional list.

-Steps to prepare custom transactional data:

Load the Data: Import the dataset into R. Transform the Data: Convert the data into a transactional format. Inspect the Data: Ensure the data is formatted correctly before applying association rule mining.

options(repos = c(CRAN = "https://cran.rstudio.com"))

install.packages("arules")

## Installing package into 'C:/Users/johns/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)

## package 'arules' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'arules'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\johns\AppData\Local\R\win-library\4.4\00LOCK\arules\libs\x64\arules.dll
## to C:\Users\johns\AppData\Local\R\win-library\4.4\arules\libs\x64\arules.dll:
## Permission denied

## Warning: restored 'arules'

## 
## The downloaded binary packages are in
##  C:\Users\johns\AppData\Local\Temp\RtmpeyXHx4\downloaded_packages

library(arules)

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

data("Groceries")
summary(Groceries)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##        labels  level2           level1
## 1 frankfurter sausage meat and sausage
## 2     sausage sausage meat and sausage
## 3  liver loaf sausage meat and sausage

inspect(head(Groceries))

##     items                     
## [1] {citrus fruit,            
##      semi-finished bread,     
##      margarine,               
##      ready soups}             
## [2] {tropical fruit,          
##      yogurt,                  
##      coffee}                  
## [3] {whole milk}              
## [4] {pip fruit,               
##      yogurt,                  
##      cream cheese ,           
##      meat spreads}            
## [5] {other vegetables,        
##      whole milk,              
##      condensed milk,          
##      long life bakery product}
## [6] {whole milk,              
##      butter,                  
##      yogurt,                  
##      rice,                    
##      abrasive cleaner}

2. Rule Generation

-Definition: Rule generation involves identifying frequent itemsets and extracting meaningful associations.

-Algorithms:

-Apriori: Uses an iterative process to generate frequent itemsets. This approach is effective but computationally expensive for large datasets.

-Eclat: Utilizes a depth-first search strategy, making it faster for dense datasets.

-FP-Growth: Uses a tree-based approach, eliminating the need for candidate generation, which enhances its performance for large datasets.

-Implementation in R:

Apriori Algorithm:

rules_apriori <- apriori(Groceries, parameter = list(supp = 0.01, conf = 0.5, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 98 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(head(rules_apriori))

##     lhs                                       rhs          support   
## [1] {curd, yogurt}                         => {whole milk} 0.01006609
## [2] {other vegetables, butter}             => {whole milk} 0.01148958
## [3] {other vegetables, domestic eggs}      => {whole milk} 0.01230300
## [4] {yogurt, whipped/sour cream}           => {whole milk} 0.01087951
## [5] {other vegetables, whipped/sour cream} => {whole milk} 0.01464159
## [6] {pip fruit, other vegetables}          => {whole milk} 0.01352313
##     confidence coverage   lift     count
## [1] 0.5823529  0.01728521 2.279125  99  
## [2] 0.5736041  0.02003050 2.244885 113  
## [3] 0.5525114  0.02226741 2.162336 121  
## [4] 0.5245098  0.02074225 2.052747 107  
## [5] 0.5070423  0.02887646 1.984385 144  
## [6] 0.5175097  0.02613116 2.025351 133

Eclat Algorithm:

rules_eclat <- eclat(Groceries, parameter = list(supp = 0.01, maxlen = 5))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.01      1      5 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 98 
## 
## create itemset ... 
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating sparse bit matrix ... [88 row(s), 9835 column(s)] done [0.00s].
## writing  ... [333 set(s)] done [0.01s].
## Creating S4 object  ... done [0.00s].

inspect(head(rules_eclat))

##     items                           support    count
## [1] {whole milk, hard cheese}       0.01006609  99  
## [2] {whole milk, butter milk}       0.01159126 114  
## [3] {other vegetables, butter milk} 0.01037112 102  
## [4] {ham, whole milk}               0.01148958 113  
## [5] {whole milk, sliced cheese}     0.01077783 106  
## [6] {whole milk, oil}               0.01128622 111

FP-Growth:

install.packages("arules")

## Warning: package 'arules' is in use and will not be installed

library(arules)
fpgrowth_rules <- apriori(Groceries, parameter = list(supp = 0.01, conf = 0.5, target="frequent itemsets"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##          NA    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen            target  ext
##      10 frequent itemsets TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 98 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## sorting transactions ... done [0.00s].
## writing ... [333 set(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(head(fpgrowth_rules))

##     items             support    count
## [1] {liquor}          0.01108287 109  
## [2] {condensed milk}  0.01026945 101  
## [3] {flower (seeds)}  0.01037112 102  
## [4] {spread cheese}   0.01118454 110  
## [5] {dish cleaner}    0.01047280 103  
## [6] {cling film/bags} 0.01138790 112

3. Rule Evaluation

-Definition: Rule evaluation involves assessing the relevance and quality of generated rules using various metrics.

-Importance: Filtering and ranking rules ensures that only meaningful, actionable patterns are retained.

-Key Metrics:

-**Support:** Measures the frequency of itemsets in the dataset.
High support indicates that the itemset appears often in transactions.

support(A=>B) = ((number of transactions containing both A and B) / (total number transactions))

-**Confidence:** Determines the reliability of the rule.
Measures the likelihood of B being bought when A is bought.

Confidence (A=>B)  = ((number of transactions containing both A and B) / (total number transactions with A))

-**Lift:** Evaluates rule strength relative to random co-occurrence.
Measures how much more likely A and B occur together compared to random chance.

lift (A=>B) = (confidence(A=>B) / support(B))

-Implementation in R:

After generating rules, evaluation metrics such as support, confidence, and lift are calculated to assess the quality of the rules:

quality(rules_apriori)

##       support confidence   coverage     lift count
## 1  0.01006609  0.5823529 0.01728521 2.279125    99
## 2  0.01148958  0.5736041 0.02003050 2.244885   113
## 3  0.01230300  0.5525114 0.02226741 2.162336   121
## 4  0.01087951  0.5245098 0.02074225 2.052747   107
## 5  0.01464159  0.5070423 0.02887646 1.984385   144
## 6  0.01352313  0.5175097 0.02613116 2.025351   133
## 7  0.01037112  0.5862069 0.01769192 3.029608   102
## 8  0.01230300  0.5845411 0.02104728 3.020999   121
## 9  0.01199797  0.5700483 0.02104728 2.230969   118
## 10 0.01514997  0.5173611 0.02928317 2.024770   149
## 11 0.01291307  0.5000000 0.02582613 2.584078   127
## 12 0.01453991  0.5629921 0.02582613 2.203354   143
## 13 0.01220132  0.5020921 0.02430097 2.594890   120
## 14 0.01270971  0.5230126 0.02430097 2.046888   125
## 15 0.02226741  0.5128806 0.04341637 2.007235   219

4. Rules Visualization and Interpretation

-Definition: Visualization allows for an intuitive understanding of association rules and patterns, enabling better business decisions.

-Importance: Visual tools help stakeholders (e.g., marketing teams, product managers) interpret complex patterns and translate them into actionable business strategies.

-Implementation in R:

Apriori Visualization:

library(arulesViz)
plot(rules_apriori, method="graph", control=list(type="items"))

## Warning: Unknown control parameters: type

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

Eclat Visualization:

plot(rules_eclat, method="paracoord", control=list(reorder=TRUE))

FP-Growth Visualization:

plot(fpgrowth_rules, method = "graph", control = list(type = "items"))

## Warning: Unknown control parameters: type

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

## Warning: Too many itemsets supplied. Only plotting the best 100 using 'support'
## (change control parameter max if needed).

## Warning: ggrepel: 16 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Business Use Case: Retail Optimization

Consider a supermarket that finds the rule {curd, yogurt} => {whole milk}.

-Actionable Insights:

- Bundle yogurt and curd together with milk in promotional offers.

- Optimize shelf placement to increase sales.

- Offer discounts on whole milk when purchasing curd and yogurt.

These insights can lead to increased sales and improved customer satisfaction by strategically placing related products together and offering targeted promotions.

Challenges in Basket Analysis

Rule Redundancy: Filtering irrelevant rules using lift and confidence thresholds.
Scalability: FP-Growth is recommended for handling large datasets.
Interpreting Rules: Businesses must consider domain knowledge before implementing strategies.

Conclusion

This paper ranked the steps involved in basket analysis, emphasizing data preparation as the foundation for effective rule mining. By comparing the Apriori, Eclat, and FP-Growth algorithms, we highlighted their respective advantages in different contexts. Visualization and interpretation of rules allow businesses to make informed decisions and optimize operations. Basket analysis, when properly implemented, can lead to valuable insights and strategic business advantages.

References

Agrawal, R., & Srikant, R. (1993). Mining Association Rules in Large Databases. ACM SIGMOD Record, 22(2), 207-216.
Hahsler, M., & Karpienko, R. (2017). Visualizing association rules in hierarchical groups. Journal of Business Economics, 87(3), 317-335.
Han, J., Kamber, M., & Pei, J. (2012). Data Mining: Concepts and Techniques. Elsevier.
Tan, P. N., Steinbach, M., & Kumar, V. (2005). Introduction to Data Mining. Addison-Wesley.
(R Documentation for arules:) [https://cran.r-project.org/web/packages/arules/arules.pdf]

What are the most important steps in basket analysis with association rules?

John Dalton Julio Fils Sinord

2025-02-12

Abstract

Introduction

Literature Review

Ranking the Most Important Steps in Basket Analysis

1. Data Preparation (Most important)

2. Rule Generation

3. Rule Evaluation

4. Rules Visualization and Interpretation

Business Use Case: Retail Optimization

Challenges in Basket Analysis

Conclusion

References