In the past stores had a precarious way of registering their sales to customers (pen and paper), this happens still in some countries. Nowadays with technological tools such as advanced software we can have detailed reports of how many, which, what type of payment and to whom a quantity of goods a supermarket sold. But what if a supermarket wants to see if a sold product has a direct relation with another? For example coffee with sugar, notebook with pens, or bread with marmalade…just for mentioning intuitive related products.
This information could bring a new horizon of possibilities for stores like making discounts, seeing which products were least purchased so we could boost the rotation of this products making special offers and so on. The scope of this tool is quiet broad for the following analysis.
The following paper will try to prove how convenient is the use of an Unsupervised Learning method called Association Rules for Market Basket Analysis in a supermarket or any retail or wholesale related store.
Association Rules is a method where we can find relationships or dependencies between variables in datasets. Finding these relationships between variables will provide useful information for decision making in any business.
Now we will proceed with our analysis.
To proceed with our analisys we need the following library packages.
library("arules")
library("arulesViz")
library("plotly")
The dataset is composed by 20 columns and 7501 rows. Rows depict costumers habits for consumption, each row represents a costumer and the goods that they buy in the supermarket.
## [1] 7501
ncol(shop)
## [1] 20
So for instance the first row indicates that in one transaction costumer number one bought shrimp, almonds, avocado, vegetables mix, green grapes, whole weat flour (typo in the dataset), yams, cottage cheese, energy drink, tomato juice, low fat yogurt, green tea, honey, salad, mineral water, salmon, antioxydant juice, frozen smoothie, spinach, olive oil. The second row shows which goods were bought in one transaction by the second costumer: burgers, meatballs and eggs. The columns only represent the different types of goods that a costumer purchased in a single transaction.
trans<-read.transactions("/Users/ayaxdiaz/Desktop/UL/Market Basket Analysis/MBA.csv", format = "basket", sep=",", header = TRUE)
## Warning in asMethod(object): removing duplicated items in transactions
trans
## transactions in sparse format with
## 7500 transactions (rows) and
## 119 items (columns)
summary(trans)
## transactions as itemMatrix in sparse format with
## 7500 rows (elements/itemsets/transactions) and
## 119 columns (items) and a density of 0.03287171
##
## most frequent items:
## mineral water eggs spaghetti french fries chocolate
## 1787 1348 1306 1282 1229
## (Other)
## 22386
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1754 1358 1044 816 667 493 391 324 259 139 102 67 40 22 17 4
## 18 19
## 1 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.912 5.000 19.000
##
## includes extended item information - examples:
## labels
## 1 almonds
## 2 antioxydant juice
## 3 asparagus
itemFrequencyPlot(trans, topN=10, type="absolute", main="Items Frequency")
Clearly we see that this supermarket has sold many mineral water, eggs, spaghetti, french fries and chocolate, just to mention the relevant ones. So it wouldn’t be a surprise finding goods that are related to these previously mentioned goods.
Also we can realize that the least purchased products were water spray, napkins, cream, bramble, tea, mashed potato and so on, which also wouldn’t be surprising not finding many direct relation of these products to other ones.
tail(sort(itemFrequency(trans, type="absolute"), decreasing=TRUE), n=10)
## ketchup oatmeal chocolate bread chutney mashed potato
## 33 33 32 31 31
## tea bramble cream napkins water spray
## 29 14 7 5 3
As we mentioned before Association Rules is a method where we can find relationships or dependencies between variables in datasets. Finding these relationships between variables will provide useful information for decision making in any business.
Market-basket analysis is one of the most intuitive applications of association rules and it strives in analyzing customer buying patterns by finding associations between items that customers put into their baskets.
rules <- apriori(trans, parameter = list(supp = 0.01, conf = 0.40))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 75
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7500 transaction(s)] done [0.00s].
## sorting and recoding items ... [75 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [17 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
We can define support as the occurrence of these two items being purchased into one basket expressed as a fraction of the total transactions. So when the support is higher, most likely the items set occurs.
This can mathematically be expressed as the following:
\(Support = \frac{Number of transactions with both A and B items}{Total number of transactions}\)
library("DT")
support_rules <- sort(rules, by = "support", decreasing = TRUE)
support_table <- inspect(support_rules)
## lhs rhs support confidence
## [1] {ground beef} => {mineral water} 0.04093333 0.4165536
## [2] {olive oil} => {mineral water} 0.02746667 0.4178499
## [3] {soup} => {mineral water} 0.02306667 0.4564644
## [4] {ground beef,spaghetti} => {mineral water} 0.01706667 0.4353741
## [5] {ground beef,mineral water} => {spaghetti} 0.01706667 0.4169381
## [6] {chocolate,spaghetti} => {mineral water} 0.01586667 0.4047619
## [7] {milk,spaghetti} => {mineral water} 0.01573333 0.4436090
## [8] {chocolate,milk} => {mineral water} 0.01400000 0.4356846
## [9] {chocolate,eggs} => {mineral water} 0.01346667 0.4056225
## [10] {eggs,milk} => {mineral water} 0.01306667 0.4242424
## [11] {frozen vegetables,spaghetti} => {mineral water} 0.01200000 0.4306220
## [12] {pancakes,spaghetti} => {mineral water} 0.01146667 0.4550265
## [13] {frozen vegetables,milk} => {mineral water} 0.01106667 0.4689266
## [14] {ground beef,milk} => {mineral water} 0.01106667 0.5030303
## [15] {chocolate,ground beef} => {mineral water} 0.01093333 0.4739884
## [16] {olive oil,spaghetti} => {mineral water} 0.01026667 0.4476744
## [17] {eggs,ground beef} => {mineral water} 0.01013333 0.5066667
## coverage lift count
## [1] 0.09826667 1.748266 307
## [2] 0.06573333 1.753707 206
## [3] 0.05053333 1.915771 173
## [4] 0.03920000 1.827256 128
## [5] 0.04093333 2.394361 128
## [6] 0.03920000 1.698777 119
## [7] 0.03546667 1.861817 118
## [8] 0.03213333 1.828559 105
## [9] 0.03320000 1.702389 101
## [10] 0.03080000 1.780536 98
## [11] 0.02786667 1.807311 90
## [12] 0.02520000 1.909736 86
## [13] 0.02360000 1.968075 83
## [14] 0.02200000 2.111207 83
## [15] 0.02306667 1.989319 82
## [16] 0.02293333 1.878880 77
## [17] 0.02000000 2.126469 76
datatable(support_table)
In this case when we sorted our data by support, we realize that ground beef was purchased along with mineral water in the most cases (307 times) based on this rule. The least transactions based on these rules was eggs and ground beef purchased along with mineral water which made 76 appearances based in the rules.
We can define confidence as the probability that a transaction that contains the items in the left hand side of the rule also contains the item on the right hand side. So when the confidence is higher, the greater the likelihood that the item in the right hand side will be purchased.
This can mathematically be expressed as the following:
\(Confidence = \frac{Number of transactions with both A and B items}{Total number of transactions with A}\)
confidence_rules <- sort(rules, by = "confidence", decreasing = TRUE)
confidence_table <- inspect(confidence_rules)
## lhs rhs support confidence
## [1] {eggs,ground beef} => {mineral water} 0.01013333 0.5066667
## [2] {ground beef,milk} => {mineral water} 0.01106667 0.5030303
## [3] {chocolate,ground beef} => {mineral water} 0.01093333 0.4739884
## [4] {frozen vegetables,milk} => {mineral water} 0.01106667 0.4689266
## [5] {soup} => {mineral water} 0.02306667 0.4564644
## [6] {pancakes,spaghetti} => {mineral water} 0.01146667 0.4550265
## [7] {olive oil,spaghetti} => {mineral water} 0.01026667 0.4476744
## [8] {milk,spaghetti} => {mineral water} 0.01573333 0.4436090
## [9] {chocolate,milk} => {mineral water} 0.01400000 0.4356846
## [10] {ground beef,spaghetti} => {mineral water} 0.01706667 0.4353741
## [11] {frozen vegetables,spaghetti} => {mineral water} 0.01200000 0.4306220
## [12] {eggs,milk} => {mineral water} 0.01306667 0.4242424
## [13] {olive oil} => {mineral water} 0.02746667 0.4178499
## [14] {ground beef,mineral water} => {spaghetti} 0.01706667 0.4169381
## [15] {ground beef} => {mineral water} 0.04093333 0.4165536
## [16] {chocolate,eggs} => {mineral water} 0.01346667 0.4056225
## [17] {chocolate,spaghetti} => {mineral water} 0.01586667 0.4047619
## coverage lift count
## [1] 0.02000000 2.126469 76
## [2] 0.02200000 2.111207 83
## [3] 0.02306667 1.989319 82
## [4] 0.02360000 1.968075 83
## [5] 0.05053333 1.915771 173
## [6] 0.02520000 1.909736 86
## [7] 0.02293333 1.878880 77
## [8] 0.03546667 1.861817 118
## [9] 0.03213333 1.828559 105
## [10] 0.03920000 1.827256 128
## [11] 0.02786667 1.807311 90
## [12] 0.03080000 1.780536 98
## [13] 0.06573333 1.753707 206
## [14] 0.04093333 2.394361 128
## [15] 0.09826667 1.748266 307
## [16] 0.03320000 1.702389 101
## [17] 0.03920000 1.698777 119
datatable(confidence_table)
Now we sorted by confidence and we realize that when a costumer buys eggs and ground beef, it is most likely that he or she will also buy mineral water and this was the transaction that had the highest confidence at a value of approximately 0.51. This indicates that there is a 51% chance that the rule with its support value is likely to happen.
We can define lift as the probability of all the items in the rule occurring together divided by the product of the probabilities of the items on the left and right hand side occurring as if there was no association between them. Lift summarizes the strength of association between the products on the left and the right hand side of the rule. So the larger the lift, the greater the link between the two products.
This can mathematically be expressed as the following:
\(Lift = \frac{Confidence}{Expected Confidence}\)
And we also can define Expected Confidence as the following:
\(Expected Confidence = \frac{Number of transactions with B}{Total Number of transactions}\)
lift_rules <- sort(rules, by = "lift", decreasing = TRUE)
lift_table <- inspect(lift_rules)
## lhs rhs support confidence
## [1] {ground beef,mineral water} => {spaghetti} 0.01706667 0.4169381
## [2] {eggs,ground beef} => {mineral water} 0.01013333 0.5066667
## [3] {ground beef,milk} => {mineral water} 0.01106667 0.5030303
## [4] {chocolate,ground beef} => {mineral water} 0.01093333 0.4739884
## [5] {frozen vegetables,milk} => {mineral water} 0.01106667 0.4689266
## [6] {soup} => {mineral water} 0.02306667 0.4564644
## [7] {pancakes,spaghetti} => {mineral water} 0.01146667 0.4550265
## [8] {olive oil,spaghetti} => {mineral water} 0.01026667 0.4476744
## [9] {milk,spaghetti} => {mineral water} 0.01573333 0.4436090
## [10] {chocolate,milk} => {mineral water} 0.01400000 0.4356846
## [11] {ground beef,spaghetti} => {mineral water} 0.01706667 0.4353741
## [12] {frozen vegetables,spaghetti} => {mineral water} 0.01200000 0.4306220
## [13] {eggs,milk} => {mineral water} 0.01306667 0.4242424
## [14] {olive oil} => {mineral water} 0.02746667 0.4178499
## [15] {ground beef} => {mineral water} 0.04093333 0.4165536
## [16] {chocolate,eggs} => {mineral water} 0.01346667 0.4056225
## [17] {chocolate,spaghetti} => {mineral water} 0.01586667 0.4047619
## coverage lift count
## [1] 0.04093333 2.394361 128
## [2] 0.02000000 2.126469 76
## [3] 0.02200000 2.111207 83
## [4] 0.02306667 1.989319 82
## [5] 0.02360000 1.968075 83
## [6] 0.05053333 1.915771 173
## [7] 0.02520000 1.909736 86
## [8] 0.02293333 1.878880 77
## [9] 0.03546667 1.861817 118
## [10] 0.03213333 1.828559 105
## [11] 0.03920000 1.827256 128
## [12] 0.02786667 1.807311 90
## [13] 0.03080000 1.780536 98
## [14] 0.06573333 1.753707 206
## [15] 0.09826667 1.748266 307
## [16] 0.03320000 1.702389 101
## [17] 0.03920000 1.698777 119
datatable(lift_table)
After making our calculations we can realize how items are associated with this rule, in our case, {ground beef, mineral water} => {spaghetti} had a lift value of around 2.40 which suggest that the items of the left hand side and the right hand side are 2.4 times more likely to be purchased together compared to purchases when the items are treated to be unrelated.
Next we will visualize these rules (support, confidence and lift) together.
plot(rules, engine="plotly")
Now we will run an analysis for mineral water and the relation that this product has with others, with this information we could think about some business strategy for boosting sales and make the supermarket get the best out if it.
water_rules <- apriori(
data = trans,
parameter = list(supp = 0.001, conf = 0.9),
appearance = list(default = "lhs", rhs = "mineral water"),
control = list(verbose = F)
)
water_rules_table <- inspect(water_rules, linebreak = FALSE)
## lhs rhs
## [1] {red wine,soup} => {mineral water}
## [2] {ground beef,light cream,olive oil} => {mineral water}
## [3] {ground beef,pancakes,whole wheat rice} => {mineral water}
## [4] {cake,olive oil,shrimp} => {mineral water}
## [5] {frozen vegetables,milk,spaghetti,turkey} => {mineral water}
## [6] {chocolate,frozen vegetables,olive oil,shrimp} => {mineral water}
## support confidence coverage lift count
## [1] 0.001866667 0.9333333 0.002000000 3.917180 14
## [2] 0.001200000 1.0000000 0.001200000 4.196978 9
## [3] 0.001333333 0.9090909 0.001466667 3.815435 10
## [4] 0.001200000 1.0000000 0.001200000 4.196978 9
## [5] 0.001200000 0.9000000 0.001333333 3.777280 9
## [6] 0.001200000 0.9000000 0.001333333 3.777280 9
datatable(water_rules_table)
In this case study, supermarkets can see which are the goods that are mostly purchased and related to others.
For example, in this supermarket the analysis of data shows that we can make a slight discount to mineral water when a costumer is hesitant of buying it after buying ground beef, light cream and olive oil, same happens with cake olive oil and shrimp. However we cannot discard other relations with mineral water, like chocolate, frozen vegetables, olive oil, and shrimp, or frozen vegetables, milk, spaghetti, and turkey.
For boosting visits of costumers we could also use this information, lets say that a costumer didn’t buy last time this articles together, so we could make individualized marketing campaigns telling the costumer that if he buys next time this bundle he could get a discount for the upcoming 3 days, that will make the costumer return to the store and claim his discount or at least visit for some randomized purchase, which was not likely to happen by not knowing this information beforehand by the store.
So in other word Association rules, an specifically Market Basket Analysis is a powerful tool which could let us know how to make bundles of goods or even services in different industries for boosting sales, rotating inventory, among others. It could also help us profile individually costumers knowing which are their habits and make special discounts for them.
Lecture materials