Association rules can be described as a method used in defining relations between variables in a particular dataset. In other words, association rules simply means the tendency of occurrence of a particular operation (transaction) based on the occurrence of a previous one.
Association rules as a market basket analysis technique, is used to measure the regularities in the purchase of different set of items.
These information gathered from the results of Association rules can be used as the basis for taking decisions about marketing activities such as, discounts (promotional pricing) or product placements.
In this particular paper I am going to use the Association rules to determine how items should be placed next to each other on the display shelves in a particular supermarket based on data from customers’ previous transactions.
The following library packages would be used in the process analyzing the dataset.
library("arules")
library("arulesViz")
library("plotly")
The dataset used in this article as mentioned earlier describes some 9835 transactions by customers in a grocery shop. The dataset was downloaded from Kaggle and can accessed from this link.
shop <- read.csv2("groceries - groceries.csv", sep = ",")
nrow(shop)
## [1] 9835
ncol(shop)
## [1] 32
The nrow(shop) indicates the total number of customers and ncol(shop) shows there are 32 different or unique kind of items which were purchased by a customer in the same basket. Now let us go ahead and read the data as transactions using the arules library package.
trans<-read.transactions("groceries - groceries.csv", format = "basket", sep=",", header = TRUE)
trans
## transactions in sparse format with
## 9835 transactions (rows) and
## 169 items (columns)
From the results shown above, we are able to know that, there are 169 varieties of items available in the supermarket. This goes ahead to tell us that, the 9835 transactions that were made by customers were based on item list of 169 which further explains that, there were some relationship between the goods that were picked among the customers.
Further vital information can be obtained using the summary function
summary(trans)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
Most frequent items section of the summary results shows items that were purchased most by customers. In this particular dataset, it can be realized that the most purchased item happens to be whole milk, followed by other vegetables and so on.
The element length distribution sizes represents the number of times the total item(s) in a basket is occuring. For example it can be seen from above that, the number of baskets that contained just one item occurred 2159 times whilst there was only one basket that contained 32 items.
The mean number of items in a basket is 4.
Now we will go ahead to visualize this information by representing it on a bar graph to show the first 10 most purchased items from the supermarket.
itemFrequencyPlot(trans, topN=10, type="absolute", main="Items Frequency")
To get a fair idea about the less frequent items, let’s sort the items at the tail point.
tail(sort(itemFrequency(trans, type="absolute"), decreasing=TRUE), n=10)
## salad dressing whisky toilet cleaner
## 8 8 7
## baby cosmetics frozen chicken bags
## 6 6 4
## kitchen utensil preservation products baby food
## 4 2 1
## sound storage medium
## 1
Association rules based on some constraints to select best rules from a set of possible rules. The major constraints are the Support and Confidence constraints. In this particular paper, I set the the support threshold to 0.01 which approximately represents the probability of an item appearing 100 times with other items in all 9835 transactions and the confidence threshold to 0.05 representing 50% of the entire threshold. The number of rules were determined as below.
rules <- apriori(trans, parameter = list(supp = 0.01, conf = 0.50))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 98
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
The support constraint is simply the occurrence of two unique items or set of unique items ending up in one basket over the total number of transactions.
\[ Support = \frac{\text{Number of Transactions with both A and B}}{\text{Total Number of Transactions}} \]
library("DT")
support_rules <- sort(rules, by = "support", decreasing = TRUE)
support_table <- inspect(support_rules)
datatable(support_table)
Sorting the rules and grouping by Support, it is realized that, the rule with the highest support value is {other vegetables, yoghurt} => whole milk and had 219 appearances out of the total which represents about 2.2% of the total transactions. The rule with the least transactions according to the support constraint was that of {curd, yoghurt} => {whole milk} which only happened 99 times representing 1% of the total transactions.
The Confidence constraint is also the occurrence of more than one unique items ending up in one basket over the total number of all other transactions containing at least one of the items in the basket.
\[ Confidence = \frac{\text{Number of Transactions with both A and B}}{\text{Total Number of Transactions with A}} \]
confidence_rules <- sort(rules, by = "confidence", decreasing = TRUE)
confidence_table <- inspect(confidence_rules)
datatable(confidence_table)
Analyzing the results from the group by Confidence, the highest confidence value happened for baskets that contained
citrus fruit, root vegetables and other vegetables. The confidence value was also around 58.6%. This means that there is a 58.6% chance that the rule with its support value is likely to happen.
Another vital constraints which is used in association rules is the lift constraint. This is determined by the value confidence over the expected confidence.
\[ Lift = \frac{\text{Confidence}}{\text{Expected Confidence}} \] The expected confidence is also given by the number of transactions with B over the total number of transactions.
\[ Expected \ Confidence = \frac{\text{Number of Transactions with B}}{\text{Total Number of Transactions}} \]
lift_rules <- sort(rules, by = "lift", decreasing = TRUE)
lift_table <- inspect(lift_rules)
datatable(lift_table)
Considering the grouping according to the lift constraint, the rule with the highest lift value was {citrus fruit, root fruit} => {other vegetables} and had a lift value of 3.03. This value explains that, the items in LHS and RHS are 3 times more likely to be purchased together compared to the purchases when they are assumed to be unrelated. The rule with the least lift value was {other vegetables,whipped/sour cream} => {whole milk} with a value of 1.98 which means the items are likely to purchase 2 times more.
Now I would go ahead to visualize these rules relative to the support, confidence and lift constraints. This plot is achieved using the plotly package. Hovering the mouse over the plotted points would give you the particular rule, support, confidence and lift values.
plot(rules, engine="plotly")
Assuming the supermarket would want to run a clearance sales and therefore attach discounts to the particular product. It would be wise to take advantage of the association rules and make higher sales on other products that are mostly purchased with that particular product.
This can be achieved by setting discounts on items on the right hand side (rhs) based on individual’s purchases on items on the left hand side (lhs).
For the purpose of this paper, let us check rules for yogurt in our transactions.
yogurt_rules <- apriori(
data = trans,
parameter = list(supp = 0.001, conf = 0.9),
appearance = list(default = "lhs", rhs = "yogurt"),
control = list(verbose = F)
)
yogurt_rules_table <- inspect(yogurt_rules, linebreak = FALSE)
datatable(yogurt_rules_table)
plot(yogurt_rules, method="graph")
It can be seen that 4 rules were created for yogurt with a very high confidence value of occurrence a little over 90%. In this case it would be easier to identify other products that sells together with yogurt and also with a high lift values above 6 implies increase in times of purchase.
As mentioned at the beginning of the paper, the aim was to identify how goods should be displayed in the supermarket such that the purchase of a particular product can influence a customer to purchase another product. To conclude this, we were able to achieve that whole milk was a frequently purchased item and also identified other items that happened to be in the same basket with it. For instance, we can conclude that, it would be wise to display whole milk somewhere closer to yogurt and other vegetables as this can increase the purchase of whole milk 2 times.
Also per the rules for a single product, yogurt in our case, supermarket can make more sales on products such as cream cheese, curd, butter and bread, etc by applying a discount on yogurt only if you purchase a specific amount of these items on the left hand side.