Association rules is an unsupervised learning technique which aims to describe and discover regularities between items in transaction data. One of their application is determining when client bought one product what will be the next product choosen by him. This knowledge of customers’ behavior allows for introducing more thought-off discounts and products placement.
Suppose that our manager asked us where to put new yogurt and tropical fruits in the store. So our task in this article is to find rules for cutomers’ busket which contains yogurt or tropical fruit. Once we have declared goal, we can proceed with data analysis.
For association analysis we will use packages ‘arules’ and ‘arulesviz’. It will be based on the Groceries dataset which comes with arules package.
library(arules)
## Warning: package 'arules' was built under R version 3.5.2
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
## Warning: package 'arulesViz' was built under R version 3.5.2
## Loading required package: grid
data(Groceries)
data <- Groceries
#size(data)
length(data)
## [1] 9835
Rules are created using apriori alogrythm. There are three main indicators used to assess the quality of rules:
Support - shows how often itemset appears in the dataset
Confidence - how often given rule is true? 1 means 100% correctness
Lift - Lift>1 -> products are positively correlated, Lift<1 -> products are negatively correlated, Lift=1 -> products are independent
Here we have to experiment a bit and find support and confidence levels which produce some results and at the same time limit amount of rules. For this dataset it will be respectively 0.01 and 0.5.
rules<-apriori(data, parameter=list(supp=0.01, conf=0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 98
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
itemFrequencyPlot(data, topN=20, type="relative", main="Item Frequency")
First graph shows us the frequency of occurance of each products in transactions. The most frequent are milk, vegetables and rolls.
rules.by.conf<-sort(rules, by="confidence", decreasing=TRUE)
inspect(head(rules.by.conf))
## lhs rhs support confidence lift count
## [1] {citrus fruit,
## root vegetables} => {other vegetables} 0.01037112 0.5862069 3.029608 102
## [2] {tropical fruit,
## root vegetables} => {other vegetables} 0.01230300 0.5845411 3.020999 121
## [3] {curd,
## yogurt} => {whole milk} 0.01006609 0.5823529 2.279125 99
## [4] {other vegetables,
## butter} => {whole milk} 0.01148958 0.5736041 2.244885 113
## [5] {tropical fruit,
## root vegetables} => {whole milk} 0.01199797 0.5700483 2.230969 118
## [6] {root vegetables,
## yogurt} => {whole milk} 0.01453991 0.5629921 2.203354 143
rules.by.lift<-sort(rules, by="lift", decreasing=TRUE)
inspect(head(rules.by.lift))
## lhs rhs support confidence lift count
## [1] {citrus fruit,
## root vegetables} => {other vegetables} 0.01037112 0.5862069 3.029608 102
## [2] {tropical fruit,
## root vegetables} => {other vegetables} 0.01230300 0.5845411 3.020999 121
## [3] {root vegetables,
## rolls/buns} => {other vegetables} 0.01220132 0.5020921 2.594890 120
## [4] {root vegetables,
## yogurt} => {other vegetables} 0.01291307 0.5000000 2.584078 127
## [5] {curd,
## yogurt} => {whole milk} 0.01006609 0.5823529 2.279125 99
## [6] {other vegetables,
## butter} => {whole milk} 0.01148958 0.5736041 2.244885 113
rules.by.supp<-sort(rules, by="support", decreasing=TRUE)
inspect(head(rules.by.supp))
## lhs rhs support confidence lift count
## [1] {other vegetables,
## yogurt} => {whole milk} 0.02226741 0.5128806 2.007235 219
## [2] {tropical fruit,
## yogurt} => {whole milk} 0.01514997 0.5173611 2.024770 149
## [3] {other vegetables,
## whipped/sour cream} => {whole milk} 0.01464159 0.5070423 1.984385 144
## [4] {root vegetables,
## yogurt} => {whole milk} 0.01453991 0.5629921 2.203354 143
## [5] {pip fruit,
## other vegetables} => {whole milk} 0.01352313 0.5175097 2.025351 133
## [6] {root vegetables,
## yogurt} => {other vegetables} 0.01291307 0.5000000 2.584078 127
From resuts above we see that maximum confidence which we are able to achive is 0.58 which means 58% reliability of the rule. Supports are quite low (2%), but the dataset has 9853 observations so it will be around 219 observations. When it comes to lift we see that items in the best rules appear together 2-3 times more ofthen than would appear together without dependencies. We see a lot of products like milk, vegetables and fruits.
Once we performed apriori algorythm and have a more detailed knowledge how in general rules work in our dataset, it is time to look for specific products. First we analyze rules containing yogurt. It turned out that we need to decrease minimal support (to 0.001) in apriori algorythm, otherwise we would receive no results.
rules.yogurt<-apriori(data=data, parameter=list(supp=0.001,conf = 0.5),
appearance=list(default="lhs", rhs="yogurt"), control=list(verbose=F))
rules.yogurt.byconf<-sort(rules.yogurt, by="support", decreasing=TRUE)
inspect(head(rules.yogurt.byconf))
## lhs rhs support confidence lift count
## [1] {tropical fruit,
## curd} => {yogurt} 0.005287239 0.5148515 3.690645 52
## [2] {tropical fruit,
## whole milk,
## whipped/sour cream} => {yogurt} 0.004372140 0.5512821 3.951792 43
## [3] {tropical fruit,
## whole milk,
## curd} => {yogurt} 0.003965430 0.6093750 4.368224 39
## [4] {root vegetables,
## cream cheese } => {yogurt} 0.003762074 0.5000000 3.584184 37
## [5] {tropical fruit,
## root vegetables,
## other vegetables,
## whole milk} => {yogurt} 0.003558719 0.5072464 3.636128 35
## [6] {other vegetables,
## whole milk,
## cream cheese } => {yogurt} 0.003457041 0.5151515 3.692795 34
The yogurt is the most often bought in the following combinations:
tropical fruit + curd -> yogurt
tropical fruit + whole milk + sour cream -> yogurt
tropical fruit + whole milk + curd -> yogurt
More combinations can be read from the above results.
rules.fruit<-apriori(data=data, parameter=list(supp=0.001,conf = 0.5),
appearance=list(default="lhs", rhs="tropical fruit"), control=list(verbose=F))
rules.fruit.byconf<-sort(rules.fruit, by="support", decreasing=TRUE)
inspect(head(rules.fruit.byconf))
## lhs rhs support confidence lift count
## [1] {citrus fruit,
## root vegetables,
## other vegetables,
## whole milk} => {tropical fruit} 0.003152008 0.5438596 5.183004 31
## [2] {citrus fruit,
## other vegetables,
## whole milk,
## yogurt} => {tropical fruit} 0.002440264 0.5106383 4.866403 24
## [3] {other vegetables,
## whole milk,
## butter,
## yogurt} => {tropical fruit} 0.002338587 0.5348837 5.097463 23
## [4] {root vegetables,
## yogurt,
## bottled water} => {tropical fruit} 0.002236909 0.5789474 5.517391 22
## [5] {pip fruit,
## grapes} => {tropical fruit} 0.002135231 0.5675676 5.408941 21
## [6] {grapes,
## other vegetables,
## whole milk} => {tropical fruit} 0.002033554 0.5263158 5.015810 20
Tropical fruits are chosen by consumers who also buy:
citrus + root vegetables + other vegetables + milk -> tropical fruits
citrus + other vegetables + milk + yogurt -> tropical fruits
other vegetables + milk + butter + yogurt-> tropical fruits
In the analytical part we calculated how yogurt and tropical fruits are related to other products, but it was only part of given task. Tropical fruits are bought by people who buy vegetables and other fruits. So should we put these products on the same shelf for easier access or maybe in the opposite corners of the shop to force clients to go through the whole shop? This is a point where data science meets the substantive expertise…
Obtaining results in plain text may be informative but not very attractive, so we will visualize obtained results.
plot(rules, method="paracoord", control=list(reorder=TRUE))
The plot above shows dependencies between between products. We can see that the most red arrows go to the other vegetables category.
We can also visualize the lift and support relationship between chosen products (yogurt and fruits). Size of the circle indicate the support of the given rule and color’s strengh shows the lift measure.
plot(rules.fruit, method="grouped")
plot(rules.yogurt, method="grouped")
Although using association analysis in order to maximize the amount of sold products is the most obvious application, it is not a only one. There are some cases when company wants to minimize the amount of sales of the one given product. The reason for such situation is when margin on this product is not satisfactory or even negative, but company cannot increase the price or withdraw it. It may seem like a very rare case, but in fact it is not. An example of such situation may be a pharmaceutical company which is selling many groups of products. Unfortunately for them, their margin on oncology products is negative because the competition has lower production costs or for whatever other reason. They cannot simply withdraw the product because it is necessary to sustain patient lifes and it is bad from moral reasons so it would end up with lower trust to the company.
What can be done is to use association rules to find with which products oncology is selling the best and cut off any form of bundling between them, to reduce the sales without drastic actions.