Market Basket Analysis is a term used when we are talking about detecting certain patterns in the behaviour of customers. Initially, it was used to determine affinities in grocery shopping, by analysing a set of products that customers bought. In this paper, I will use it in this particular context. But MBA nowadays goes beyond groceries and can be found in other segments of the economy, whenever we are talking about the behaviour of customers. This technique enables to increase sales of products and consequently increase revenues of a company. By detecting the patterns in customer behaviour, management of a shop can come up with a new layout of products, that will for example group the ones that are often bought together close to each other, making the customers even more willing to buy them at once. Another selling strategy would be to implement discounts for products that are often bought together, to make them even more desirable for the customers.
To be able to implement these selling strategies, one should first detect patterns, which in data mining are called association rules. The easiest way to understand how they work is to present an example. Let’s look at the following one: {whole milk, coffee} -> {sugar}. It can be understood as follows: if a customer bought whole milk, and coffee he/she is also likely to buy sugar. This might be fairly easy and intuitive, but when you analyse thousands of transactions with many products you might find surprising patterns that are not obvious at first and would be very hard to detect without a broad dataset. Let’s discuss some terminology behind the association rules, to help you understand how to assess their strength, correctness and frequency of occurrence.
According to Wikipedia: “Support is an indication of how frequently the itemset appears in a dataset”.
Let’s assume that an itemset {whole milk, coffee} has a support value equal to 0.1. It would mean that customers on average buy whole milk and coffee together in 10% of their transactions (or in one out of 10 transactions). Another assumption would be that the itemset {whole milk, coffee, sugar} has support equal to 0.05. Similarly it would mean that customers buy on average milk, coffee and sugar together in 5% of their transactions (this itemset appears on average in one out of 20 transactions). The higher the support value the more likely is a certain itemset appear in the transaction. We will be therefore looking for rules, that have relatively high support values.
“Confidence is an indication of how often the rule has been found to be true.”
To clarify this definition let’s go back to our main example. If itemset {whole milk, coffee} is present in 10% of transactions (support = 0.10) and the broader itemset {whole milk, coffee, sugar} can be found in 5% of transactions (support = 0.05), than to estimate the confidence of the rule {whole milk, coffee} -> {sugar} we need to divide the corresponding support values: 0.05 / 0.10 = 0.5 (or 50 %). In other words, we can be confident in 50% of the cases, that whenever someone buys milk and coffee he/she will also buy sugar.
It is not an easy task to provide the definition of lift value as was in the case of the previous two measures. The explanation that seems to be the most understandable is the one that I came across on quora. The author encourages to assess itemsets in terms of probabilities. The formula for lift is: support(X, Y) / support(X) * support(Y)
in our example, it would be: support(whole milk, coffee, sugar) / support(whole milk, coffee) * support(sugar)
We know that support(whole milk, coffee, sugar) = 0.05, support(whole milk, coffee) = 0.1. Now let’s assume that supp(sugar) = 0.3 (sugar is present in 30% of transactions)
Our lift value would be 0.05 / 0.1 * 0.3 = 0.05 / 0.03 = 1.67.
Now we could interpret it as follows: since the value of lift is higher than 1, we can assume that the probability of buying sugar is dependent on buying the antecedent ({whole milk, coffee}) and in this case (since the lift > 1) this relation is positive (customer is more likely to buy the sugar). If the lift is equal to 1, we would assume that these two events {whole milk, coffee} and {sugar} are independent of each other and if lift < 1, then the event of buying milk and coffee makes the consequent event (buying sugar) less probable than if these two were independent. So similarly to the values of confidence and support, we will be looking for the highest possible lift values in further analysis.
The data analysed in this paper can be found on Kaggle under this link: https://www.kaggle.com/irfanasrullah/groceries. This dataset contains 9835 transactions by customer shopping for groceries. The transactions has 169 unique items overall. Let’s explore this dataset with the usage of arules library.
library(arules)
trans <- read.transactions(
"groceries.csv",
format = "basket",
sep = ",",
skip = 0
)
trans
## transactions in sparse format with
## 9835 transactions (rows) and
## 169 items (columns)
inspect(head(trans))
## items
## [1] {citrus fruit,
## margarine,
## ready soups,
## semi-finished bread}
## [2] {coffee,
## tropical fruit,
## yogurt}
## [3] {whole milk}
## [4] {cream cheese,
## meat spreads,
## pip fruit,
## yogurt}
## [5] {condensed milk,
## long life bakery product,
## other vegetables,
## whole milk}
## [6] {abrasive cleaner,
## butter,
## rice,
## whole milk,
## yogurt}
itemFrequencyPlot(trans, topN = 30, support = 0.05, col = '#99CCFF', main = 'Relative item frequency', ylab = 'frequency')
itemFrequencyPlot(trans, topN = 30, type = 'absolute', col = '#339966', main = 'Absolute item frequency', ylab = 'frequency')
In the graphs above we can see items that occurred most frequently among all transactions. Whole milk is the most popular product when it comes to grocery shopping for this sample. Milk occurs in more than 25% of transactions (or more than 2500 cartons in absolute terms). Soda is present in around 18% of transactions. I could name a couple of more frequencies of products, but without further delay, let’s proceed to retrieving and analysing the rules.
For the purpose of this paper, I will only analyse rules for two products of my own choice which are: soda and sausage.
To find the strongest connections (rules) between the itemsets that will increase the chance of buying soda or sausage I will use the apriori algorithm. This algorithm uses a bottom-up approach and consists of two main steps:
This process is repeated in an iterative approach to discover the most frequent itemsets.
rules_soda <-
apriori(
data = trans,
parameter = list(supp = 0.02, conf = 0.2),
appearance = list(default = "lhs", rhs = "soda")
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.02 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 196
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [59 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [5 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules_soda_sup <- sort(rules_soda, by = 'support', decreasing = TRUE)
inspect(head(rules_soda_sup, n = 10))
## lhs rhs support confidence coverage lift count
## [1] {rolls/buns} => {soda} 0.03833249 0.2084024 0.18393493 1.195124 377
## [2] {bottled water} => {soda} 0.02897814 0.2621895 0.11052364 1.503577 285
## [3] {shopping bags} => {soda} 0.02460600 0.2497420 0.09852567 1.432194 242
## [4] {sausage} => {soda} 0.02430097 0.2586580 0.09395018 1.483324 239
## [5] {pastry} => {soda} 0.02104728 0.2365714 0.08896797 1.356665 207
rules_sausage <-
apriori(
data = trans,
parameter = list(supp = 0.01, conf = 0.1),
appearance = list(default = "lhs", rhs = "sausage")
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.1 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 98
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [16 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules_sausage_sup <- sort(rules_sausage, by = 'support', decreasing = TRUE)
inspect(head(rules_sausage_sup, n = 10))
## lhs rhs support confidence coverage lift
## [1] {rolls/buns} => {sausage} 0.03060498 0.1663903 0.18393493 1.771048
## [2] {whole milk} => {sausage} 0.02989324 0.1169916 0.25551601 1.245252
## [3] {other vegetables} => {sausage} 0.02694459 0.1392538 0.19349263 1.482209
## [4] {soda} => {sausage} 0.02430097 0.1393586 0.17437722 1.483324
## [5] {yogurt} => {sausage} 0.01962379 0.1406706 0.13950178 1.497289
## [6] {shopping bags} => {sausage} 0.01565836 0.1589267 0.09852567 1.691606
## [7] {root vegetables} => {sausage} 0.01494662 0.1371269 0.10899847 1.459570
## [8] {tropical fruit} => {sausage} 0.01392984 0.1327519 0.10493137 1.413004
## [9] {pastry} => {sausage} 0.01250635 0.1405714 0.08896797 1.496234
## [10] {bottled water} => {sausage} 0.01199797 0.1085557 0.11052364 1.155460
## count
## [1] 301
## [2] 294
## [3] 265
## [4] 239
## [5] 193
## [6] 154
## [7] 147
## [8] 137
## [9] 123
## [10] 118
With the use of apriori function, I established the strongest association rules for both sausage and soda. Since the soda appears more frequently in transactions than sausage (which can be seen in item frequency plots), I set the confidence and support minimum levels higher for soda than for the sausage (namely 2% support and 20% confidence for soda and 1% and 10% confidence for sausage). The results are obtained with the function inspect. But to see them clearly let’s visualize the rules with various graphs.
Firstly, I will analyse the plots for the rules established for sausages.
library(arulesViz)
plot(rules_sausage, method="graph")
plot(rules_sausage, method = 'paracoord', control=list(reorder=TRUE))
The apriori algorithm distinguished 16 rules for sausages. All 3 graphs clearly indicate that one of the rules stands out from the whole set. The rule is: {rolls/buns} -> {sausage}, which have both support and lift values on a high level. We can interpret it’s lift value (equal to circa 1.77) as follows: the person is on average 1.77 times more likely to buy a sausage when he has a roll/bun already in his basket than the other customers (without the roll/bun in the basket).
plot(rules_soda, method="graph")
plot(rules_soda, method = 'paracoord', control=list(reorder=TRUE))
In the case of soda, the apriori algorithm distinguished 5 rules (that fulfil the desired minimum levels of support and confidence). Here again, buying the rule rolls/buns is associated with buying the consequent item - soda. This rule has the largest support, but the lift value is not very high (circa 1.2). The higher lift value is obtained with the water being an antecedent of the rule (around 1.5). So if a person buys water, he/she is 1.5 times more likely to buy soda as well, contrary to other customers that do not buy the water.
All things considered, the rules presented above can be a good example of how useful the market basket analysis can be for the retailers. With the usage of the rules and a deep understanding of obtained results (support and lift values in particular), the retailer would be able to enhance the placement of the products and come up with many profitable up or cross-sell strategies.
Bibliography:
• https://en.wikipedia.org/wiki/Association_rule_learning • https://www.quora.com/How-can-I-interpret-the-formula-of-lift-ratio-in-association-rule • https://www.softwaretestinghelp.com/apriori-algorithm/