Preliminary

Association rule mining is a widely used data mining technique aimed at discovering interesting relationships between items in datasets. It goes by various names such as frequent itemset analysis, association analysis, or association rule learning.

In R, performing association rule mining is facilitated by the arules package, which offers comprehensive tools for generating association rules from transaction datasets. Additionally, the arulesViz package complements this process by providing visualization capabilities to explore and interpret the discovered association rules.

library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(arulesViz)

In association analysis, the focus often lies on determining the frequency with which certain sets of items, typically products, appear together in customer transactions, commonly referred to as baskets. This method finds extensive application in understanding consumer purchasing behavior in supermarkets and retail chains, where transaction details are recorded using point-of-sale scanners. These transaction records, typically structured as tuples consisting of transaction IDs and corresponding item IDs, provide valuable insights into consumer preferences and purchasing patterns. By identifying frequent itemsets, retailers can gain valuable insights into items commonly purchased together, enabling them to devise strategies to enhance sales.

To exemplify the fundamental concepts of association rule mining, we will construct a series of baskets simulating transactions within a grocery store. It is important to acknowledge that this example employs a simplified dataset, featuring a limited number of baskets and items. Consequently, the conclusions drawn from this analysis may differ from those derived in real-world scenarios. The resulting dataset will be stored as a sequence of transactions named “transactions”

Initially, you need to import the transaction data into an object of the “transaction class” to facilitate data analysis. This step involves utilizing a function provided by the arules package.

transactions <- read.transactions("groceries - groceries.csv", sep = ",", rm.duplicates = TRUE)
## Warning in readLines(file, encoding = encoding): incomplete final line found on
## 'groceries - groceries.csv'

Inspect the dimensions of this object

dim(transactions)
## [1] 9836  231

This means we have 9836 transactions and 231 distinct items.

summary(transactions)
## transactions as itemMatrix in sparse format with
##  9836 rows (elements/itemsets/transactions) and
##  231 columns (items) and a density of 0.0234297 
## 
## most frequent items:
##       whole milk                1 other vegetables       rolls/buns 
##             2513             2159             1903             1809 
##             soda          (Other) 
##             1715            43136 
## 
## element (itemset/transaction) length distribution:
## sizes
##    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   18   19   20   21   22   23   24   25   27   28   29   30   33 
##   29   14   14    9   11    4    6    1    1    1    1    3    2 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   4.000   5.412   7.000  33.000 
## 
## includes extended item information - examples:
##   labels
## 1      1
## 2     10
## 3     11

Upon analyzing the transaction data, we observe that it is represented as an itemMatrix in sparse format, comprising 9836 rows (transactions) and 231 columns (items). The density of this matrix is approximately 0.0234, indicating that it contains a relatively small proportion of non-zero elements.

Among the items recorded in the transactions, the most frequent ones include “whole milk,” “other vegetables,” “rolls/buns,” and “soda,” with respective occurrences of 2513, 2159, 1903, and 1809. These items constitute a significant portion of the transactions, indicative of their popularity among customers.

Examining the distribution of transaction lengths, we find that the majority of transactions contain 2 to 5 items, with the most common length being 2. However, there are also instances of longer transactions, with lengths ranging up to 33 items. The mean transaction length is approximately 5.41 items, indicating a tendency towards relatively small transaction sizes.

This summary provides valuable insights into the structure and characteristics of the transaction data, laying the groundwork for further analysis and interpretation.

Relative item frequency

itemFrequencyPlot(transactions, topN = 10, cex.names = 1)

The subsequent stage entails rule analysis employing the A-Priori Algorithm through the apriori() function. This function necessitates simultaneous specification of both a minimum support and a minimum confidence constraint. The option parameter enables setting the support threshold, confidence threshold, and maximum length of items (maxlen). In the absence of provided threshold values, the function defaults to utilizing support threshold of 0.1 and confidence threshold of 0.8 for the analysis.

rules <- apriori(transactions, 
                 parameter = list(supp = 0.01, conf = 0.1, maxlen = 10, target = "rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.1    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 98 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[231 item(s), 9836 transaction(s)] done [0.00s].
## sorting and recoding items ... [100 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [522 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(rules)
## set of 522 rules
## 
## rule length distribution (lhs + rhs):sizes
##   1   2   3 
##  12 414  96 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.000   2.161   2.000   3.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.01007   Min.   :0.1004   Min.   :0.01728   Min.   :0.4884  
##  1st Qu.:0.01149   1st Qu.:0.1398   1st Qu.:0.05246   1st Qu.:1.2832  
##  Median :0.01454   Median :0.2007   Median :0.07666   Median :1.5759  
##  Mean   :0.02088   Mean   :0.2384   Mean   :0.11130   Mean   :1.6338  
##  3rd Qu.:0.02105   3rd Qu.:0.3134   3rd Qu.:0.10899   3rd Qu.:1.8924  
##  Max.   :0.25549   Max.   :0.5989   Max.   :1.00000   Max.   :3.3726  
##      count       
##  Min.   :  99.0  
##  1st Qu.: 113.0  
##  Median : 143.0  
##  Mean   : 205.4  
##  3rd Qu.: 207.0  
##  Max.   :2513.0  
## 
## mining info:
##          data ntransactions support confidence
##  transactions          9836    0.01        0.1
##                                                                                                    call
##  apriori(data = transactions, parameter = list(supp = 0.01, conf = 0.1, maxlen = 10, target = "rules"))

The analysis presents a comprehensive summary of a set of 522 association rules derived from a dataset of transactions.

Rule Length Distribution:

The rules exhibit a varied length distribution, ranging from 1 to 3 items in both the antecedent (lhs) and consequent (rhs). The majority of rules are of length 2, indicating that they involve two items in the antecedent and/or consequent.

Summary of Quality Measures:

The quality measures of the rules provide insights into their significance and reliability. - Support: Ranges from 0.01007 to 0.25549, indicating the proportion of transactions containing the rule’s items. - Confidence: Spans from 0.1004 to 0.5989, representing the conditional probability of occurrence of the consequent given the antecedent. - Coverage: Indicates the proportion of transactions containing the antecedent, ranging from 0.01728 to 1.0. - Lift: Reflects the strength of association between antecedent and consequent, with values ranging from 0.4884 to 3.3726. - Count: Varies from 99 to 2513, signifying the frequency of occurrence of the rules.

Mining Info:

The rules were generated using the Apriori algorithm with specified parameters: - Data: Derived from a dataset comprising 9836 transactions. - Support Threshold: Set at 0.01, ensuring that rules have a minimum occurrence rate in transactions. - Confidence Threshold: Established at 0.1, indicating the minimum level of confidence required for a rule to be considered significant. - Maxlen: Indicates the maximum length of rules generated, set at 10. - Target: Specifies the output format as rules.

Overall, this analysis offers valuable insights into the association patterns within the dataset, providing actionable information for business decision-making, such as optimizing product placements, cross-selling strategies, or targeted marketing campaigns based on identified associations between items.

install.packages("arules")
## Warning: package 'arules' is in use and will not be installed
library(arules)

inspect(head(rules))
##     lhs    rhs              support   confidence coverage lift count
## [1] {}  => {1}              0.2194998 0.2194998  1        1    2159 
## [2] {}  => {2}              0.1670394 0.1670394  1        1    1643 
## [3] {}  => {4}              0.1021757 0.1021757  1        1    1005 
## [4] {}  => {3}              0.1320659 0.1320659  1        1    1299 
## [5] {}  => {bottled water}  0.1105124 0.1105124  1        1    1087 
## [6] {}  => {tropical fruit} 0.1049207 0.1049207  1        1    1032
library(arulesViz)




# Scatterplot of support vs. confidence
plot(rules, method = "scatter", measure = c("support", "confidence"))
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Visualizing association rules Mining association rules often results in a very large number of found rules, leaving the analyst with the task to go through all the rules and discover interesting ones. Sifting manually through large sets of rules is time consuming and strenuous. Therefore, in addition to our calculations of associations, we can use the package arulesViz to visualize our results as:

Scatter-plots, interactive scatter-plots and Individual rule representations.

Scatterplot of support vs. lift

plot(rules, method = "scatter", measure = c("support", "lift"))
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Scatterplot of confidence vs. lift

plot(rules, method = "scatter", measure = c("confidence", "lift"))

Graph-based visualization methods focus on illustrating relationships between individual items within a set of rules. These techniques represent rules or itemsets as graphs, where items are labeled vertices, and rules or itemsets are depicted as vertices connected to items via arrows.

In the case of rules, arrows point from the items on the left-hand side (LHS) towards the vertex representing the rule, and arrows point towards the items on the right-hand side (RHS) from the rule vertex.

There are various engines available for generating these visualizations. The default engine uses igraph, employing functions like plot.igraph and tkplot for interactive visualization. Additional arguments can be passed to these functions for customization, such as color.

The network graph displayed below illustrates associations between selected items. Larger circles indicate higher support, while red circles represent higher lift. Graph-based visualizations are most effective with a small number of rules, so we limit our display to a subset of 10 rules from our dataset.

subrules <- head(rules, n = 10, by = "confidence")

inspect(subrules)
##      lhs                                  rhs                support   
## [1]  {11}                              => {whole milk}       0.01108174
## [2]  {citrus fruit, root vegetables}   => {other vegetables} 0.01037007
## [3]  {root vegetables, tropical fruit} => {other vegetables} 0.01230175
## [4]  {curd, yogurt}                    => {whole milk}       0.01006507
## [5]  {butter, other vegetables}        => {whole milk}       0.01148841
## [6]  {root vegetables, tropical fruit} => {whole milk}       0.01199675
## [7]  {root vegetables, yogurt}         => {whole milk}       0.01453843
## [8]  {11}                              => {other vegetables} 0.01026840
## [9]  {domestic eggs, other vegetables} => {whole milk}       0.01230175
## [10] {whipped/sour cream, yogurt}      => {whole milk}       0.01087841
##      confidence coverage   lift     count
## [1]  0.5989011  0.01850346 2.344127 109  
## [2]  0.5862069  0.01769012 3.029916 102  
## [3]  0.5845411  0.02104514 3.021306 121  
## [4]  0.5823529  0.01728345 2.279357  99  
## [5]  0.5736041  0.02002847 2.245113 113  
## [6]  0.5700483  0.02104514 2.231196 118  
## [7]  0.5629921  0.02582351 2.203578 143  
## [8]  0.5549451  0.01850346 2.868334 101  
## [9]  0.5525114  0.02226515 2.162556 121  
## [10] 0.5245098  0.02074014 2.052956 107
plot(subrules, method = "graph",  engine = "htmlwidget")
plot(subrules, method="paracoord")

These rules indicate the presence of strong associations between certain items. The confidence values indicate the likelihood of the consequent occurring given the antecedent, while lift measures the strength of association between them.

Conclusion

In conclusion, the analysis of our dataset utilizing market basket analysis techniques has provided valuable insights into consumer behavior within a grocery store setting. We have identified frequent itemsets, with items such as “whole milk,” “other vegetables,” and “rolls/buns” emerging as popular choices among customers. The generated association rules, totaling 522, reveal significant relationships between items, characterized by varying support, confidence, coverage, and lift values.

The rule analysis has unveiled actionable patterns for businesses, such as the tendency of certain items to be frequently purchased together, indicating opportunities for strategic product placement or bundled promotions. Additionally, the visualization of association rules through scatterplots and network graphs has facilitated the exploration of complex relationships within the dataset, aiding in the identification of high-impact associations.

Overall, the findings from this analysis equip businesses with valuable insights to optimize sales strategies, enhance customer experiences, and drive informed decision-making in retail environments.