INTRODUCTION

In the world of retail, understanding customer behavior is paramount, and the practice of Market Basket Analysis serves as the compass guiding retailers through this intricate landscape. In this academic project, we embark on a journey into the realm of grocery shopping patterns using the ‘Groceries’ dataset, a treasure trove of consumer purchasing data. This dataset, nestled within the ‘arules’ package, encapsulates the shopping habits of countless buyers, offering a unique window into their preferences and choices.

Market Basket Analysis is a powerful technique that helps retailers decipher the complex web of associations between products. By identifying which items are frequently purchased together, supermarkets and retailers gain invaluable insights into customer preferences. These insights, in turn, have the potential to reshape supply chain strategies, inform product recommendations, and even influence store layouts to enhance the overall shopping experience.

In the steps that follow, we will journey through the data pre-processing, exploration, item frequency analysis, and association rule generation steps, unveiling fascinating patterns along the way. Our aim is to provide a comprehensive understanding of the ‘Groceries’ dataset and offer actionable insights that can empower retailers to make data-driven decisions and better cater to the ever-evolving needs of their customers. Join us as we unlock the secrets of grocery shopping behaviors and explore the transformative potential of Market Basket Analysis.

DATA PRE-PROCESSING & EXPLORATION

Our journey begins with data pre-processing. We load essential R packages, such as ‘arules’ and ‘arulesViz,’ to harness the power of association rule mining. The heart of our analysis lies in the ‘Groceries’ dataset, which encapsulates the purchases of countless shoppers.

We kick off our exploration by addressing fundamental questions about the dataset:

These questions set the stage for our exploration, providing a foundational understanding of the dataset.

Load necessary packages:

library(arules)
library(arulesViz) 
# Load "Grocieries" dataset (found in 'arules' package)
data(Groceries)

Describe the “Groceries” dataset by answering class & dimension-related questions:

summary(Groceries)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##        labels  level2           level1
## 1 frankfurter sausage meat and sausage
## 2     sausage sausage meat and sausage
## 3  liver loaf sausage meat and sausage
class(Groceries)  # alternative code for "class"
## [1] "transactions"
## attr(,"package")
## [1] "arules"
dim(Groceries)  # alternative code for number of rows & columns
## [1] 9835  169

The ‘summary()’ function is used to help us describe the “Groceries” dataset; with “Groceries” belonging to the ‘transactions’ class (i.e., a transaction database) and communicated in the form of a sparse matrix. Furthermore, the “Groceries” dataset has 9835 rows (i.e., itemsets/transactions) and 169 columns (i.e., individual grocery items)./ The same conclusion will also be reached when applying the ‘class()’ function (e.g., to describe the class of “Groceries”) and the ‘dim()’ function (e.g., results in the number of rows and columns, respectively) to the data.

ITEM FREQUENCY ANALYSIS

Next, we venture into the realm of item frequency analysis. Armed with the ‘itemFrequencyPlot’ function, we uncover patterns among grocery items. By setting a support threshold, we identify items that tend to appear together in shoppers’ baskets. This analysis illuminates the items that have a significant presence in our dataset, giving us a glimpse into popular shopping combinations.

# Generate an item frequency barplot for the grocery items with support rate greater than 0.05
itemFrequencyPlot(Groceries, support = 0.05, cex.names = 0.6, main = 'Item Frequency Plot', ylab = "Item Frequency")

# The argument 'support = 0.05' denotes the support threshold in 'itemFrequencyPlot()'

# 'cex.names = 0.6' adjusts the size of the x-axis labels for easier interpretation

ASSOCIATION RULE GENERATION

Our journey takes an exciting turn as we generate association rules using the Apriori algorithm. By setting parameters for support, confidence, and minimum length, we extract valuable rules that answer the question: “If the customer purchases itemset ‘x,’ itemset ‘y’ is also purchased.”

We then dissect these rules to understand the relationships between grocery items. We create subsets of rules, grouping them by the presence of the term “chicken” in either the antecedent (lhs) or consequent (rhs). These rules provide valuable insights into the co-purchase tendencies of items, shedding light on intriguing shopping patterns.

# Create a subset of rules that contain the grocery item "Chicken"*

basket <- apriori(Groceries, parameter = list(support = 0.001, confidence = 0.25, minlen = 2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5   0.001      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [17391 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].

Note: this allows us to generate rules saying: “If the customer purchases itemset”x” itemset ‘y’ is also purchased.”

# View output of the 'apriori()' function (i.e., association rules generated from "Groceries")
summary(basket)  
## set of 17391 rules
## 
## rule length distribution (lhs + rhs):sizes
##    2    3    4    5    6 
##  367 6906 8371 1687   60 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   4.000   3.665   4.000   6.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift        
##  Min.   :0.001017   Min.   :0.2500   Min.   :0.001017   Min.   : 0.9784  
##  1st Qu.:0.001118   1st Qu.:0.3125   1st Qu.:0.002542   1st Qu.: 2.1557  
##  Median :0.001322   Median :0.4016   Median :0.003559   Median : 2.7955  
##  Mean   :0.001914   Mean   :0.4394   Mean   :0.004972   Mean   : 3.0837  
##  3rd Qu.:0.001932   3rd Qu.:0.5397   3rd Qu.:0.005186   3rd Qu.: 3.6654  
##  Max.   :0.074835   Max.   :1.0000   Max.   :0.255516   Max.   :35.7158  
##      count       
##  Min.   : 10.00  
##  1st Qu.: 11.00  
##  Median : 13.00  
##  Mean   : 18.83  
##  3rd Qu.: 19.00  
##  Max.   :736.00  
## 
## mining info:
##       data ntransactions support confidence
##  Groceries          9835   0.001       0.25
##                                                                                         call
##  apriori(data = Groceries, parameter = list(support = 0.001, confidence = 0.25, minlen = 2))
# Subset rules with "chicken" in rhs
r.rules <- subset(basket, subset = rhs %pin% "chicken")

# Subset rules with "chicken" in lhs
l.rules <- subset(basket, subset = lhs %pin% "chicken")

# '%pin%' applies the method of "partial matching"; allowing us to generate a subset of rules containing "chicken," generated from the "Groceries" transaction data.
# Combine the rules generated above
chicken <- union(r.rules, l.rules)  
# Describes the subset of rules containing "chicken"
summary(chicken)  
## set of 532 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4   5 
##   3 226 276  27 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   4.000   3.615   4.000   5.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift      
##  Min.   :0.001017   Min.   :0.2500   Min.   :0.001220   Min.   :1.198  
##  1st Qu.:0.001118   1st Qu.:0.3191   1st Qu.:0.002440   1st Qu.:2.263  
##  Median :0.001322   Median :0.4167   Median :0.003355   Median :2.944  
##  Mean   :0.001672   Mean   :0.4487   Mean   :0.004136   Mean   :3.186  
##  3rd Qu.:0.001729   3rd Qu.:0.5611   3rd Qu.:0.004779   3rd Qu.:3.771  
##  Max.   :0.017895   Max.   :0.8571   Max.   :0.042908   Max.   :9.711  
##      count       
##  Min.   : 10.00  
##  1st Qu.: 11.00  
##  Median : 13.00  
##  Mean   : 16.45  
##  3rd Qu.: 17.00  
##  Max.   :176.00  
## 
## mining info:
##       data ntransactions support confidence
##  Groceries          9835   0.001       0.25
##                                                                                         call
##  apriori(data = Groceries, parameter = list(support = 0.001, confidence = 0.25, minlen = 2))
# Inspect 10 rules in 'chicken' subset with the greatest lift values
inspect(sort(chicken, by = "lift")[1:10])
##      lhs                           rhs                     support confidence    coverage     lift count
## [1]  {citrus fruit,                                                                                     
##       root vegetables,                                                                                  
##       domestic eggs}            => {chicken}           0.001016777  0.4166667 0.002440264 9.710703    10
## [2]  {other vegetables,                                                                                 
##       whole milk,                                                                                       
##       domestic eggs,                                                                                    
##       rolls/buns}               => {chicken}           0.001016777  0.3703704 0.002745297 8.631736    10
## [3]  {citrus fruit,                                                                                     
##       yogurt,                                                                                           
##       domestic eggs}            => {chicken}           0.001016777  0.3448276 0.002948653 8.036444    10
## [4]  {sausage,                                                                                          
##       citrus fruit,                                                                                     
##       root vegetables}          => {chicken}           0.001016777  0.3448276 0.002948653 8.036444    10
## [5]  {sausage,                                                                                          
##       chicken,                                                                                          
##       citrus fruit}             => {root vegetables}   0.001016777  0.7692308 0.001321810 7.057262    10
## [6]  {other vegetables,                                                                                 
##       whole milk,                                                                                       
##       whipped/sour cream,                                                                               
##       rolls/buns}               => {chicken}           0.001118454  0.2972973 0.003762074 6.928718    11
## [7]  {chicken,                                                                                          
##       chocolate}                => {butter}            0.001016777  0.3703704 0.002745297 6.683656    10
## [8]  {sausage,                                                                                          
##       chicken,                                                                                          
##       whole milk}               => {butter}            0.001118454  0.3666667 0.003050330 6.616820    11
## [9]  {chicken,                                                                                          
##       root vegetables,                                                                                  
##       other vegetables,                                                                                 
##       whole milk}               => {domestic eggs}     0.001220132  0.4137931 0.002948653 6.521883    12
## [10] {chicken,                                                                                          
##       long life bakery product} => {frozen vegetables} 0.001016777  0.3125000 0.003253686 6.497754    10

INTERPRETING BUSINESS INSIGHTS

In the business world, data-driven decisions are invaluable. We take four selected association rules and decode their implications. These rules reveal the conditional probabilities of purchasing specific items when others are in the shopping cart. Lift values quantify the strength of these associations, highlighting the significance of each rule.

But what do these rules mean for a supermarket retailer, such as Star Market? They hold the key to optimizing inventory, enhancing customer recommendations, and optimizing store layouts. Retailers can use this information to predict buying behavior, ensuring they stock items that are frequently purchased together, offer tailored recommendations, and improve the overall shopping experience.

# Create/inspect new object with 4 rules sorted by the greatest lift values & selecting the row index of 2 rhs/2 lhs rules with "chicken" (obtained from previous code chunk/output)

chicken.rules <- sort(chicken, by = "lift")[c(1:2, 5, 7)]

inspect(chicken.rules)
##     lhs                    rhs                   support confidence    coverage     lift count
## [1] {citrus fruit,                                                                            
##      root vegetables,                                                                         
##      domestic eggs}     => {chicken}         0.001016777  0.4166667 0.002440264 9.710703    10
## [2] {other vegetables,                                                                        
##      whole milk,                                                                              
##      domestic eggs,                                                                           
##      rolls/buns}        => {chicken}         0.001016777  0.3703704 0.002745297 8.631736    10
## [3] {sausage,                                                                                 
##      chicken,                                                                                 
##      citrus fruit}      => {root vegetables} 0.001016777  0.7692308 0.001321810 7.057262    10
## [4] {chicken,                                                                                 
##      chocolate}         => {butter}          0.001016777  0.3703704 0.002745297 6.683656    10

For each of these 4 association rules, the antecedent is denoted by the left-hand side (‘lhs’) of the rule and the consequent is denoted by the right-hand side (‘rhs’). In general, these rules tell us about the associative relationship between the antecedent and the consequent itemsets; and can be read in the form: “customers who purchased itemset ‘a’ also purchased itemset ‘b’.” In terms of support, or how frequently a given itemset appears (e.g., which is equivalent for all 4 rules); the rules can be interpreted as follows: around 0.0010168 of all transactions under analysis have itemsets with the antecedent and consequent itemsets. Looking at the confidence of each of the rules, or the ‘conditional probability’ that such transactions will occur; the rules can be interpreted in the following format:

  1. Given all the itemsets that contain citrus fruit, root vegetables, and domestic eggs, 41.47% of those transactions/itemsets also contained chicken.

  2. Given all the itemsets that contain other vegetables, whole milk, domestic eggs, and rolls/buns, 37.04% of those transactions/itemsets also contained chicken.

  3. Given all the itemsets that contain sausage, chicken, & citrus fruit, 76.92% of those transactions/itemsets also contained root vegetables.

  4. Given all the itemsets that contain chicken and chocolate, 37.04% of those transactions/itemsets also contained butter.

Finally, based on the lift values corresponding to each of the rules (i.e., a quality measure describing the association between the antecedent and consequent itemsets - accounting for the “chance”/“randomness” that might play a role in this rule), we can describe the association rules in terms of their lift as follows:

  1. Given that a person purchases citrus fruit, root vegetables, and domestic eggs, that person is 9.71 times more likely to also buy chicken than is the ‘random’ shopper.

  2. Given that a person purchases other vegetables, whole milk, domestic eggs, and rolls/buns, that person is 8.63 times more likely to also buy chicken than is the ‘random’ shopper.

  3. Given that a person purchases sausage, chicken, & citrus fruit, that person is 7.06 times more likely to also buy root vegetables than is the ‘random’ shopper.

  4. Given that a person purchases chicken and chocolate, that person is 6.68 times more likely to also buy butter than is the ‘random’ shopper.

What meaning might these rules have for a supermarket retailer, such as Star Market? What could it do with this information?

A supermarket retailer can generate association rules in accordance with a market basket analysis; in which the customer transaction database in analyzed to determine the association between the grocery items. In other words, a supermarket, such as Star Market, would be able to determine “what goes with what” when customers are shopping; answering questions like: what items are frequently purchase together?, what is the likelihood of those itemsets occuring together/being purchased together (i.e., of all transactions)?, given that one itemset is purchased (e.g., the antecedent itemset, like {citrus fruit, root vegetables, and domestic eggs} for rule 1), what is the likelihood that the consequent itemset is also purchased (e.g., the consequent itemset, such as {chicken} for rule 1)? These transactional data-driven insights might support future decisions regarding its supply chain/inventory (e.g., maybe the demand for a certain item/itemset is expected to increase and that itemset is often purchased with a second item/itemset … then the supermarket can use this information to ensure proper inventory levels for those itemsets). This information might also help to inform decisions regarding recommendations to customers in an online setting (i.e., informing a recommender system for the supermarket) or the store layout, as those items/itemsets that are said to be associated might be located in the same aisle/section of the store to make those associated items more accessible and enhance the customer shopping experience.

DATA VISUALIZATION

To conclude our journey, we employ data visualization techniques to make these insights come to life. Scatter plots and graph-based visualizations provide interactive tools to explore and understand the relationships between items and rules. These visuals facilitate a deeper understanding of the data, making it accessible to a wider audience.

# Scatter plot to visualize 3 rules with the greatest lift 
plot(chicken.rules[1:3])

The output is a scatter plot with confidence being depicted on the y-axis, support on the x-axis, and the lift associated with each rule is communicated using the technique of color shading (e.g., the darker the color, the greater lift value for that respective rule - and vice versa). The distribution of confidence to support is interesting for these 3 rules, as their support is relatively equal but the confidence is slightly dispersed (e.g., around 0.37, 0.417, & 0.769, respectively). Likewise, the scatter plot shows an inverse relationship between lift and confidence values for each of the 3 rules - with a higher lift values (e.g., 8.6 & 9.7) corresponding to lower confidence (e.g., 0.37 & 0.417) for the 2 rules closest to the x-axis, and lower lift (e.g., 7.05) corresponding to higher confidence (e.g., 0.769) values for the rule all the way at the top. Based on the output of the ‘inspect()’ function applied in a previous step, we know that the rule with lower lift and high confidence is on in which the itemset containing “chicken” is the antecedent.

# Generate a graph-based visualization of 3 rules with interactive capabilities
plot(chicken.rules[1:3], method = "graph", engine = "htmlwidget")

As a result of applying the argument ‘method = “graph”’ to the ‘plot()’ function, the 3 association rules containing “chicken” are visualized in a graph format (i.e., like a directed network diagram/graph - using nodes and arrows). This allows us to view the associations between given itemsets in a manner that is intuitive and easier to understand. Each rule is a node and the itemsets are given in rectangular boxes. The second argument, ‘engine = “htmlwidget”’, allows for interactive capabilities; unlike the static visualization format of the scatter plot in the previous step). The interactive functions of this plot enables a seamless retrieval of information about each of the rules when a user hovers their mouse over the points (e.g., support, confidence, lift, etc.) and allows for manual or interactive filtering by certain rules or individual items. Such a format might be useful for deeper exploration/comparison of the given rules about the association between items/itemsets (i.e., those with “chicken”).