Basket analysis using association rules is a key data mining technique for identifying relationships between frequently purchased items. This paper ranks and analyzes the most critical steps in basket analysis, detailing their implementation in R. We discuss rule generation, evaluation metrics, visualization, and their impact on business decision-making. Additionally, we compare Apriori, Eclat, and FP-Growth algorithms, highlighting their advantages and limitations.
Market Basket Analysis (MBA) is a widely used technique in data mining to uncover purchasing patterns by analyzing transaction data. Association rule learning extracts patterns that reveal relationships between items frequently purchased together, which can be used to enhance product placement, cross-selling strategies, and inventory management. This paper emphasizes the steps involved in MBA and their practical application in R, offering an in-depth comparison of the three primary algorithms: Apriori, Eclat, and FP-Growth.
Association rule mining was first introduced by Agrawal et al. (1993) and has since evolved significantly. The three primary algorithms used are:
Apriori Algorithm: Uses an iterative approach to generate frequent itemsets and extract association rules. It is computationally expensive but widely used in practice.
Eclat Algorithm: Employs a depth-first search strategy, making it faster for dense datasets.
FP-Growth Algorithm: Builds a compact tree structure, eliminating the need for candidate generation, making it efficient for large datasets.
Recent research (Hahsler & Karpienko, 2017) has shown that FP-Growth outperforms Apriori in scalability, while Eclat is efficient for small, dense datasets. This paper implements these methods and compares their effectiveness.
Definition: Data preparation involves cleaning and structuring transactional data into a suitable format for rule mining.
Importance: The quality of the input data significantly affects the results of association rule mining. Without proper data preparation, insights can be misleading or incomplete.
-Implementation in R:
We used the Groceries dataset, which is pre-installed in the arules R package. Each transaction is represented as a list of purchased items. The items are encoded as binary values (1 for purchased, 0 for not purchased). Data inspection helps ensure that it’s well-structured before rule generation.
**NB:** While we used the Groceries dataset in this paper for demonstration, other datasets can also be used, as long as they can be formatted into a transactional list. For example, a custom dataset in CSV format, where each row represents a transaction and each column represents a product, can be converted into a transactional list format. This enables the use of association rule mining techniques on a wide variety of datasets.
The dataset needs to be transformed into a transactional list.
-Steps to prepare custom transactional data:
Load the Data: Import the dataset into R. Transform the Data: Convert the data into a transactional format. Inspect the Data: Ensure the data is formatted correctly before applying association rule mining.
options(repos = c(CRAN = "https://cran.rstudio.com"))
install.packages("arules")
## Installing package into 'C:/Users/johns/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'arules' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'arules'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\johns\AppData\Local\R\win-library\4.4\00LOCK\arules\libs\x64\arules.dll
## to C:\Users\johns\AppData\Local\R\win-library\4.4\arules\libs\x64\arules.dll:
## Permission denied
## Warning: restored 'arules'
##
## The downloaded binary packages are in
## C:\Users\johns\AppData\Local\Temp\RtmpeyXHx4\downloaded_packages
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
data("Groceries")
summary(Groceries)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels level2 level1
## 1 frankfurter sausage meat and sausage
## 2 sausage sausage meat and sausage
## 3 liver loaf sausage meat and sausage
inspect(head(Groceries))
## items
## [1] {citrus fruit,
## semi-finished bread,
## margarine,
## ready soups}
## [2] {tropical fruit,
## yogurt,
## coffee}
## [3] {whole milk}
## [4] {pip fruit,
## yogurt,
## cream cheese ,
## meat spreads}
## [5] {other vegetables,
## whole milk,
## condensed milk,
## long life bakery product}
## [6] {whole milk,
## butter,
## yogurt,
## rice,
## abrasive cleaner}
-Definition: Rule generation involves identifying frequent itemsets and extracting meaningful associations.
-Algorithms:
-Apriori: Uses an iterative process to generate frequent itemsets. This approach is effective but computationally expensive for large datasets.
-Eclat: Utilizes a depth-first search strategy, making it faster for dense datasets.
-FP-Growth: Uses a tree-based approach, eliminating the need for candidate generation, which enhances its performance for large datasets.
-Implementation in R:
Apriori Algorithm:
rules_apriori <- apriori(Groceries, parameter = list(supp = 0.01, conf = 0.5, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.01 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 98
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(head(rules_apriori))
## lhs rhs support
## [1] {curd, yogurt} => {whole milk} 0.01006609
## [2] {other vegetables, butter} => {whole milk} 0.01148958
## [3] {other vegetables, domestic eggs} => {whole milk} 0.01230300
## [4] {yogurt, whipped/sour cream} => {whole milk} 0.01087951
## [5] {other vegetables, whipped/sour cream} => {whole milk} 0.01464159
## [6] {pip fruit, other vegetables} => {whole milk} 0.01352313
## confidence coverage lift count
## [1] 0.5823529 0.01728521 2.279125 99
## [2] 0.5736041 0.02003050 2.244885 113
## [3] 0.5525114 0.02226741 2.162336 121
## [4] 0.5245098 0.02074225 2.052747 107
## [5] 0.5070423 0.02887646 1.984385 144
## [6] 0.5175097 0.02613116 2.025351 133
Eclat Algorithm:
rules_eclat <- eclat(Groceries, parameter = list(supp = 0.01, maxlen = 5))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.01 1 5 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 98
##
## create itemset ...
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating sparse bit matrix ... [88 row(s), 9835 column(s)] done [0.00s].
## writing ... [333 set(s)] done [0.01s].
## Creating S4 object ... done [0.00s].
inspect(head(rules_eclat))
## items support count
## [1] {whole milk, hard cheese} 0.01006609 99
## [2] {whole milk, butter milk} 0.01159126 114
## [3] {other vegetables, butter milk} 0.01037112 102
## [4] {ham, whole milk} 0.01148958 113
## [5] {whole milk, sliced cheese} 0.01077783 106
## [6] {whole milk, oil} 0.01128622 111
FP-Growth:
install.packages("arules")
## Warning: package 'arules' is in use and will not be installed
library(arules)
fpgrowth_rules <- apriori(Groceries, parameter = list(supp = 0.01, conf = 0.5, target="frequent itemsets"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## NA 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 frequent itemsets TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 98
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## sorting transactions ... done [0.00s].
## writing ... [333 set(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(head(fpgrowth_rules))
## items support count
## [1] {liquor} 0.01108287 109
## [2] {condensed milk} 0.01026945 101
## [3] {flower (seeds)} 0.01037112 102
## [4] {spread cheese} 0.01118454 110
## [5] {dish cleaner} 0.01047280 103
## [6] {cling film/bags} 0.01138790 112
-Definition: Rule evaluation involves assessing the relevance and quality of generated rules using various metrics.
-Importance: Filtering and ranking rules ensures that only meaningful, actionable patterns are retained.
-Key Metrics:
-**Support:** Measures the frequency of itemsets in the dataset.
High support indicates that the itemset appears often in transactions.
support(A=>B) = ((number of transactions containing both A and B) / (total number transactions))
-**Confidence:** Determines the reliability of the rule.
Measures the likelihood of B being bought when A is bought.
Confidence (A=>B) = ((number of transactions containing both A and B) / (total number transactions with A))
-**Lift:** Evaluates rule strength relative to random co-occurrence.
Measures how much more likely A and B occur together compared to random chance.
lift (A=>B) = (confidence(A=>B) / support(B))
-Implementation in R:
After generating rules, evaluation metrics such as support, confidence, and lift are calculated to assess the quality of the rules:
quality(rules_apriori)
## support confidence coverage lift count
## 1 0.01006609 0.5823529 0.01728521 2.279125 99
## 2 0.01148958 0.5736041 0.02003050 2.244885 113
## 3 0.01230300 0.5525114 0.02226741 2.162336 121
## 4 0.01087951 0.5245098 0.02074225 2.052747 107
## 5 0.01464159 0.5070423 0.02887646 1.984385 144
## 6 0.01352313 0.5175097 0.02613116 2.025351 133
## 7 0.01037112 0.5862069 0.01769192 3.029608 102
## 8 0.01230300 0.5845411 0.02104728 3.020999 121
## 9 0.01199797 0.5700483 0.02104728 2.230969 118
## 10 0.01514997 0.5173611 0.02928317 2.024770 149
## 11 0.01291307 0.5000000 0.02582613 2.584078 127
## 12 0.01453991 0.5629921 0.02582613 2.203354 143
## 13 0.01220132 0.5020921 0.02430097 2.594890 120
## 14 0.01270971 0.5230126 0.02430097 2.046888 125
## 15 0.02226741 0.5128806 0.04341637 2.007235 219
-Definition: Visualization allows for an intuitive understanding of association rules and patterns, enabling better business decisions.
-Importance: Visual tools help stakeholders (e.g., marketing teams, product managers) interpret complex patterns and translate them into actionable business strategies.
-Implementation in R:
Apriori Visualization:
library(arulesViz)
plot(rules_apriori, method="graph", control=list(type="items"))
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
Eclat Visualization:
plot(rules_eclat, method="paracoord", control=list(reorder=TRUE))
FP-Growth Visualization:
plot(fpgrowth_rules, method = "graph", control = list(type = "items"))
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
## Warning: Too many itemsets supplied. Only plotting the best 100 using 'support'
## (change control parameter max if needed).
## Warning: ggrepel: 16 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Consider a supermarket that finds the rule {curd, yogurt} => {whole milk}.
-Actionable Insights:
- Bundle yogurt and curd together with milk in promotional offers.
- Optimize shelf placement to increase sales.
- Offer discounts on whole milk when purchasing curd and yogurt.
These insights can lead to increased sales and improved customer satisfaction by strategically placing related products together and offering targeted promotions.
Rule Redundancy: Filtering irrelevant rules using lift and confidence thresholds.
Scalability: FP-Growth is recommended for handling large datasets.
Interpreting Rules: Businesses must consider domain knowledge before implementing strategies.
This paper ranked the steps involved in basket analysis, emphasizing data preparation as the foundation for effective rule mining. By comparing the Apriori, Eclat, and FP-Growth algorithms, we highlighted their respective advantages in different contexts. Visualization and interpretation of rules allow businesses to make informed decisions and optimize operations. Basket analysis, when properly implemented, can lead to valuable insights and strategic business advantages.