In the world of retail, understanding customer behavior is paramount, and the practice of Market Basket Analysis serves as the compass guiding retailers through this intricate landscape. In this academic project, we embark on a journey into the realm of grocery shopping patterns using the ‘Groceries’ dataset, a treasure trove of consumer purchasing data. This dataset, nestled within the ‘arules’ package, encapsulates the shopping habits of countless buyers, offering a unique window into their preferences and choices.
Market Basket Analysis is a powerful technique that helps retailers decipher the complex web of associations between products. By identifying which items are frequently purchased together, supermarkets and retailers gain invaluable insights into customer preferences. These insights, in turn, have the potential to reshape supply chain strategies, inform product recommendations, and even influence store layouts to enhance the overall shopping experience.
In the steps that follow, we will journey through the data pre-processing, exploration, item frequency analysis, and association rule generation steps, unveiling fascinating patterns along the way. Our aim is to provide a comprehensive understanding of the ‘Groceries’ dataset and offer actionable insights that can empower retailers to make data-driven decisions and better cater to the ever-evolving needs of their customers. Join us as we unlock the secrets of grocery shopping behaviors and explore the transformative potential of Market Basket Analysis.
Our journey begins with data pre-processing. We load essential R packages, such as ‘arules’ and ‘arulesViz,’ to harness the power of association rule mining. The heart of our analysis lies in the ‘Groceries’ dataset, which encapsulates the purchases of countless shoppers.
We kick off our exploration by addressing fundamental questions about the dataset:
What class does “Groceries” belong to?
How many rows and columns does “Groceries” contain?
These questions set the stage for our exploration, providing a foundational understanding of the dataset.
Load necessary packages:
library(arules)
library(arulesViz)
# Load "Grocieries" dataset (found in 'arules' package)
data(Groceries)
Describe the “Groceries” dataset by answering class & dimension-related questions:
summary(Groceries)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels level2 level1
## 1 frankfurter sausage meat and sausage
## 2 sausage sausage meat and sausage
## 3 liver loaf sausage meat and sausage
class(Groceries) # alternative code for "class"
## [1] "transactions"
## attr(,"package")
## [1] "arules"
dim(Groceries) # alternative code for number of rows & columns
## [1] 9835 169
The ‘summary()’ function is used to help us describe the “Groceries” dataset; with “Groceries” belonging to the ‘transactions’ class (i.e., a transaction database) and communicated in the form of a sparse matrix. Furthermore, the “Groceries” dataset has 9835 rows (i.e., itemsets/transactions) and 169 columns (i.e., individual grocery items)./ The same conclusion will also be reached when applying the ‘class()’ function (e.g., to describe the class of “Groceries”) and the ‘dim()’ function (e.g., results in the number of rows and columns, respectively) to the data.
Next, we venture into the realm of item frequency analysis. Armed with the ‘itemFrequencyPlot’ function, we uncover patterns among grocery items. By setting a support threshold, we identify items that tend to appear together in shoppers’ baskets. This analysis illuminates the items that have a significant presence in our dataset, giving us a glimpse into popular shopping combinations.
# Generate an item frequency barplot for the grocery items with support rate greater than 0.05
itemFrequencyPlot(Groceries, support = 0.05, cex.names = 0.6, main = 'Item Frequency Plot', ylab = "Item Frequency")
# The argument 'support = 0.05' denotes the support threshold in 'itemFrequencyPlot()'
# 'cex.names = 0.6' adjusts the size of the x-axis labels for easier interpretation
Our journey takes an exciting turn as we generate association rules using the Apriori algorithm. By setting parameters for support, confidence, and minimum length, we extract valuable rules that answer the question: “If the customer purchases itemset ‘x,’ itemset ‘y’ is also purchased.”
We then dissect these rules to understand the relationships between grocery items. We create subsets of rules, grouping them by the presence of the term “chicken” in either the antecedent (lhs) or consequent (rhs). These rules provide valuable insights into the co-purchase tendencies of items, shedding light on intriguing shopping patterns.
# Create a subset of rules that contain the grocery item "Chicken"*
basket <- apriori(Groceries, parameter = list(support = 0.001, confidence = 0.25, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.25 0.1 1 none FALSE TRUE 5 0.001 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [17391 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
Note: this allows us to generate rules saying: “If the customer purchases itemset”x” itemset ‘y’ is also purchased.”
# View output of the 'apriori()' function (i.e., association rules generated from "Groceries")
summary(basket)
## set of 17391 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6
## 367 6906 8371 1687 60
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 4.000 3.665 4.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001017 Min. :0.2500 Min. :0.001017 Min. : 0.9784
## 1st Qu.:0.001118 1st Qu.:0.3125 1st Qu.:0.002542 1st Qu.: 2.1557
## Median :0.001322 Median :0.4016 Median :0.003559 Median : 2.7955
## Mean :0.001914 Mean :0.4394 Mean :0.004972 Mean : 3.0837
## 3rd Qu.:0.001932 3rd Qu.:0.5397 3rd Qu.:0.005186 3rd Qu.: 3.6654
## Max. :0.074835 Max. :1.0000 Max. :0.255516 Max. :35.7158
## count
## Min. : 10.00
## 1st Qu.: 11.00
## Median : 13.00
## Mean : 18.83
## 3rd Qu.: 19.00
## Max. :736.00
##
## mining info:
## data ntransactions support confidence
## Groceries 9835 0.001 0.25
## call
## apriori(data = Groceries, parameter = list(support = 0.001, confidence = 0.25, minlen = 2))
# Subset rules with "chicken" in rhs
r.rules <- subset(basket, subset = rhs %pin% "chicken")
# Subset rules with "chicken" in lhs
l.rules <- subset(basket, subset = lhs %pin% "chicken")
# '%pin%' applies the method of "partial matching"; allowing us to generate a subset of rules containing "chicken," generated from the "Groceries" transaction data.
# Combine the rules generated above
chicken <- union(r.rules, l.rules)
# Describes the subset of rules containing "chicken"
summary(chicken)
## set of 532 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 3 226 276 27
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 4.000 3.615 4.000 5.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001017 Min. :0.2500 Min. :0.001220 Min. :1.198
## 1st Qu.:0.001118 1st Qu.:0.3191 1st Qu.:0.002440 1st Qu.:2.263
## Median :0.001322 Median :0.4167 Median :0.003355 Median :2.944
## Mean :0.001672 Mean :0.4487 Mean :0.004136 Mean :3.186
## 3rd Qu.:0.001729 3rd Qu.:0.5611 3rd Qu.:0.004779 3rd Qu.:3.771
## Max. :0.017895 Max. :0.8571 Max. :0.042908 Max. :9.711
## count
## Min. : 10.00
## 1st Qu.: 11.00
## Median : 13.00
## Mean : 16.45
## 3rd Qu.: 17.00
## Max. :176.00
##
## mining info:
## data ntransactions support confidence
## Groceries 9835 0.001 0.25
## call
## apriori(data = Groceries, parameter = list(support = 0.001, confidence = 0.25, minlen = 2))
# Inspect 10 rules in 'chicken' subset with the greatest lift values
inspect(sort(chicken, by = "lift")[1:10])
## lhs rhs support confidence coverage lift count
## [1] {citrus fruit,
## root vegetables,
## domestic eggs} => {chicken} 0.001016777 0.4166667 0.002440264 9.710703 10
## [2] {other vegetables,
## whole milk,
## domestic eggs,
## rolls/buns} => {chicken} 0.001016777 0.3703704 0.002745297 8.631736 10
## [3] {citrus fruit,
## yogurt,
## domestic eggs} => {chicken} 0.001016777 0.3448276 0.002948653 8.036444 10
## [4] {sausage,
## citrus fruit,
## root vegetables} => {chicken} 0.001016777 0.3448276 0.002948653 8.036444 10
## [5] {sausage,
## chicken,
## citrus fruit} => {root vegetables} 0.001016777 0.7692308 0.001321810 7.057262 10
## [6] {other vegetables,
## whole milk,
## whipped/sour cream,
## rolls/buns} => {chicken} 0.001118454 0.2972973 0.003762074 6.928718 11
## [7] {chicken,
## chocolate} => {butter} 0.001016777 0.3703704 0.002745297 6.683656 10
## [8] {sausage,
## chicken,
## whole milk} => {butter} 0.001118454 0.3666667 0.003050330 6.616820 11
## [9] {chicken,
## root vegetables,
## other vegetables,
## whole milk} => {domestic eggs} 0.001220132 0.4137931 0.002948653 6.521883 12
## [10] {chicken,
## long life bakery product} => {frozen vegetables} 0.001016777 0.3125000 0.003253686 6.497754 10
In the business world, data-driven decisions are invaluable. We take four selected association rules and decode their implications. These rules reveal the conditional probabilities of purchasing specific items when others are in the shopping cart. Lift values quantify the strength of these associations, highlighting the significance of each rule.
But what do these rules mean for a supermarket retailer, such as Star Market? They hold the key to optimizing inventory, enhancing customer recommendations, and optimizing store layouts. Retailers can use this information to predict buying behavior, ensuring they stock items that are frequently purchased together, offer tailored recommendations, and improve the overall shopping experience.
# Create/inspect new object with 4 rules sorted by the greatest lift values & selecting the row index of 2 rhs/2 lhs rules with "chicken" (obtained from previous code chunk/output)
chicken.rules <- sort(chicken, by = "lift")[c(1:2, 5, 7)]
inspect(chicken.rules)
## lhs rhs support confidence coverage lift count
## [1] {citrus fruit,
## root vegetables,
## domestic eggs} => {chicken} 0.001016777 0.4166667 0.002440264 9.710703 10
## [2] {other vegetables,
## whole milk,
## domestic eggs,
## rolls/buns} => {chicken} 0.001016777 0.3703704 0.002745297 8.631736 10
## [3] {sausage,
## chicken,
## citrus fruit} => {root vegetables} 0.001016777 0.7692308 0.001321810 7.057262 10
## [4] {chicken,
## chocolate} => {butter} 0.001016777 0.3703704 0.002745297 6.683656 10
For each of these 4 association rules, the antecedent is denoted by the left-hand side (‘lhs’) of the rule and the consequent is denoted by the right-hand side (‘rhs’). In general, these rules tell us about the associative relationship between the antecedent and the consequent itemsets; and can be read in the form: “customers who purchased itemset ‘a’ also purchased itemset ‘b’.” In terms of support, or how frequently a given itemset appears (e.g., which is equivalent for all 4 rules); the rules can be interpreted as follows: around 0.0010168 of all transactions under analysis have itemsets with the antecedent and consequent itemsets. Looking at the confidence of each of the rules, or the ‘conditional probability’ that such transactions will occur; the rules can be interpreted in the following format:
Given all the itemsets that contain citrus fruit, root vegetables, and domestic eggs, 41.47% of those transactions/itemsets also contained chicken.
Given all the itemsets that contain other vegetables, whole milk, domestic eggs, and rolls/buns, 37.04% of those transactions/itemsets also contained chicken.
Given all the itemsets that contain sausage, chicken, & citrus fruit, 76.92% of those transactions/itemsets also contained root vegetables.
Given all the itemsets that contain chicken and chocolate, 37.04% of those transactions/itemsets also contained butter.
Finally, based on the lift values corresponding to each of the rules (i.e., a quality measure describing the association between the antecedent and consequent itemsets - accounting for the “chance”/“randomness” that might play a role in this rule), we can describe the association rules in terms of their lift as follows:
Given that a person purchases citrus fruit, root vegetables, and domestic eggs, that person is 9.71 times more likely to also buy chicken than is the ‘random’ shopper.
Given that a person purchases other vegetables, whole milk, domestic eggs, and rolls/buns, that person is 8.63 times more likely to also buy chicken than is the ‘random’ shopper.
Given that a person purchases sausage, chicken, & citrus fruit, that person is 7.06 times more likely to also buy root vegetables than is the ‘random’ shopper.
Given that a person purchases chicken and chocolate, that person is 6.68 times more likely to also buy butter than is the ‘random’ shopper.
What meaning might these rules have for a supermarket retailer, such as Star Market? What could it do with this information?
A supermarket retailer can generate association rules in accordance with a market basket analysis; in which the customer transaction database in analyzed to determine the association between the grocery items. In other words, a supermarket, such as Star Market, would be able to determine “what goes with what” when customers are shopping; answering questions like: what items are frequently purchase together?, what is the likelihood of those itemsets occuring together/being purchased together (i.e., of all transactions)?, given that one itemset is purchased (e.g., the antecedent itemset, like {citrus fruit, root vegetables, and domestic eggs} for rule 1), what is the likelihood that the consequent itemset is also purchased (e.g., the consequent itemset, such as {chicken} for rule 1)? These transactional data-driven insights might support future decisions regarding its supply chain/inventory (e.g., maybe the demand for a certain item/itemset is expected to increase and that itemset is often purchased with a second item/itemset … then the supermarket can use this information to ensure proper inventory levels for those itemsets). This information might also help to inform decisions regarding recommendations to customers in an online setting (i.e., informing a recommender system for the supermarket) or the store layout, as those items/itemsets that are said to be associated might be located in the same aisle/section of the store to make those associated items more accessible and enhance the customer shopping experience.
To conclude our journey, we employ data visualization techniques to make these insights come to life. Scatter plots and graph-based visualizations provide interactive tools to explore and understand the relationships between items and rules. These visuals facilitate a deeper understanding of the data, making it accessible to a wider audience.
# Scatter plot to visualize 3 rules with the greatest lift
plot(chicken.rules[1:3])
The output is a scatter plot with confidence being depicted on the y-axis, support on the x-axis, and the lift associated with each rule is communicated using the technique of color shading (e.g., the darker the color, the greater lift value for that respective rule - and vice versa). The distribution of confidence to support is interesting for these 3 rules, as their support is relatively equal but the confidence is slightly dispersed (e.g., around 0.37, 0.417, & 0.769, respectively). Likewise, the scatter plot shows an inverse relationship between lift and confidence values for each of the 3 rules - with a higher lift values (e.g., 8.6 & 9.7) corresponding to lower confidence (e.g., 0.37 & 0.417) for the 2 rules closest to the x-axis, and lower lift (e.g., 7.05) corresponding to higher confidence (e.g., 0.769) values for the rule all the way at the top. Based on the output of the ‘inspect()’ function applied in a previous step, we know that the rule with lower lift and high confidence is on in which the itemset containing “chicken” is the antecedent.
# Generate a graph-based visualization of 3 rules with interactive capabilities
plot(chicken.rules[1:3], method = "graph", engine = "htmlwidget")
As a result of applying the argument ‘method = “graph”’ to the ‘plot()’ function, the 3 association rules containing “chicken” are visualized in a graph format (i.e., like a directed network diagram/graph - using nodes and arrows). This allows us to view the associations between given itemsets in a manner that is intuitive and easier to understand. Each rule is a node and the itemsets are given in rectangular boxes. The second argument, ‘engine = “htmlwidget”’, allows for interactive capabilities; unlike the static visualization format of the scatter plot in the previous step). The interactive functions of this plot enables a seamless retrieval of information about each of the rules when a user hovers their mouse over the points (e.g., support, confidence, lift, etc.) and allows for manual or interactive filtering by certain rules or individual items. Such a format might be useful for deeper exploration/comparison of the given rules about the association between items/itemsets (i.e., those with “chicken”).