Association rules for retail shop

Association rules is an unsupervised learning technique which aims to describe and discover regularities between items in transaction data. One of their application is determining when client bought one product what will be the next product choosen by him. This knowledge of customers’ behavior allows for introducing more thought-off discounts and products placement.

Suppose that our manager asked us where to put new yogurt and tropical fruits in the store. So our task in this article is to find rules for cutomers’ busket which contains yogurt or tropical fruit. Once we have declared goal, we can proceed with data analysis.

For association analysis we will use packages ‘arules’ and ‘arulesviz’. It will be based on the Groceries dataset which comes with arules package.

Data

library(arules)
## Warning: package 'arules' was built under R version 3.5.2
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(arulesViz)
## Warning: package 'arulesViz' was built under R version 3.5.2
## Loading required package: grid
data(Groceries)
data <-  Groceries

#size(data) 
length(data)
## [1] 9835

Rules

Rules are created using apriori alogrythm. There are three main indicators used to assess the quality of rules:

Support - shows how often itemset appears in the dataset
Confidence - how often given rule is true? 1 means 100% correctness
Lift - Lift>1 -> products are positively correlated, Lift<1 -> products are negatively correlated, Lift=1 -> products are independent

Here we have to experiment a bit and find support and confidence levels which produce some results and at the same time limit amount of rules. For this dataset it will be respectively 0.01 and 0.5.

rules<-apriori(data, parameter=list(supp=0.01, conf=0.5)) 
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 98 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
itemFrequencyPlot(data, topN=20, type="relative", main="Item Frequency") 

First graph shows us the frequency of occurance of each products in transactions. The most frequent are milk, vegetables and rolls.

rules.by.conf<-sort(rules, by="confidence", decreasing=TRUE) 
inspect(head(rules.by.conf))
##     lhs                   rhs                   support confidence     lift count
## [1] {citrus fruit,                                                               
##      root vegetables}  => {other vegetables} 0.01037112  0.5862069 3.029608   102
## [2] {tropical fruit,                                                             
##      root vegetables}  => {other vegetables} 0.01230300  0.5845411 3.020999   121
## [3] {curd,                                                                       
##      yogurt}           => {whole milk}       0.01006609  0.5823529 2.279125    99
## [4] {other vegetables,                                                           
##      butter}           => {whole milk}       0.01148958  0.5736041 2.244885   113
## [5] {tropical fruit,                                                             
##      root vegetables}  => {whole milk}       0.01199797  0.5700483 2.230969   118
## [6] {root vegetables,                                                            
##      yogurt}           => {whole milk}       0.01453991  0.5629921 2.203354   143
rules.by.lift<-sort(rules, by="lift", decreasing=TRUE) 
inspect(head(rules.by.lift))
##     lhs                   rhs                   support confidence     lift count
## [1] {citrus fruit,                                                               
##      root vegetables}  => {other vegetables} 0.01037112  0.5862069 3.029608   102
## [2] {tropical fruit,                                                             
##      root vegetables}  => {other vegetables} 0.01230300  0.5845411 3.020999   121
## [3] {root vegetables,                                                            
##      rolls/buns}       => {other vegetables} 0.01220132  0.5020921 2.594890   120
## [4] {root vegetables,                                                            
##      yogurt}           => {other vegetables} 0.01291307  0.5000000 2.584078   127
## [5] {curd,                                                                       
##      yogurt}           => {whole milk}       0.01006609  0.5823529 2.279125    99
## [6] {other vegetables,                                                           
##      butter}           => {whole milk}       0.01148958  0.5736041 2.244885   113
rules.by.supp<-sort(rules, by="support", decreasing=TRUE) 
inspect(head(rules.by.supp))
##     lhs                     rhs                   support confidence     lift count
## [1] {other vegetables,                                                             
##      yogurt}             => {whole milk}       0.02226741  0.5128806 2.007235   219
## [2] {tropical fruit,                                                               
##      yogurt}             => {whole milk}       0.01514997  0.5173611 2.024770   149
## [3] {other vegetables,                                                             
##      whipped/sour cream} => {whole milk}       0.01464159  0.5070423 1.984385   144
## [4] {root vegetables,                                                              
##      yogurt}             => {whole milk}       0.01453991  0.5629921 2.203354   143
## [5] {pip fruit,                                                                    
##      other vegetables}   => {whole milk}       0.01352313  0.5175097 2.025351   133
## [6] {root vegetables,                                                              
##      yogurt}             => {other vegetables} 0.01291307  0.5000000 2.584078   127

From resuts above we see that maximum confidence which we are able to achive is 0.58 which means 58% reliability of the rule. Supports are quite low (2%), but the dataset has 9853 observations so it will be around 219 observations. When it comes to lift we see that items in the best rules appear together 2-3 times more ofthen than would appear together without dependencies. We see a lot of products like milk, vegetables and fruits.

Once we performed apriori algorythm and have a more detailed knowledge how in general rules work in our dataset, it is time to look for specific products. First we analyze rules containing yogurt. It turned out that we need to decrease minimal support (to 0.001) in apriori algorythm, otherwise we would receive no results.

rules.yogurt<-apriori(data=data, parameter=list(supp=0.001,conf = 0.5), 
                      appearance=list(default="lhs", rhs="yogurt"), control=list(verbose=F)) 

rules.yogurt.byconf<-sort(rules.yogurt, by="support", decreasing=TRUE)

inspect(head(rules.yogurt.byconf))
##     lhs                     rhs          support confidence     lift count
## [1] {tropical fruit,                                                      
##      curd}               => {yogurt} 0.005287239  0.5148515 3.690645    52
## [2] {tropical fruit,                                                      
##      whole milk,                                                          
##      whipped/sour cream} => {yogurt} 0.004372140  0.5512821 3.951792    43
## [3] {tropical fruit,                                                      
##      whole milk,                                                          
##      curd}               => {yogurt} 0.003965430  0.6093750 4.368224    39
## [4] {root vegetables,                                                     
##      cream cheese }      => {yogurt} 0.003762074  0.5000000 3.584184    37
## [5] {tropical fruit,                                                      
##      root vegetables,                                                     
##      other vegetables,                                                    
##      whole milk}         => {yogurt} 0.003558719  0.5072464 3.636128    35
## [6] {other vegetables,                                                    
##      whole milk,                                                          
##      cream cheese }      => {yogurt} 0.003457041  0.5151515 3.692795    34

The yogurt is the most often bought in the following combinations:
tropical fruit + curd -> yogurt
tropical fruit + whole milk + sour cream -> yogurt
tropical fruit + whole milk + curd -> yogurt

More combinations can be read from the above results.

rules.fruit<-apriori(data=data, parameter=list(supp=0.001,conf = 0.5), 
                      appearance=list(default="lhs", rhs="tropical fruit"), control=list(verbose=F)) 

rules.fruit.byconf<-sort(rules.fruit, by="support", decreasing=TRUE)

inspect(head(rules.fruit.byconf))
##     lhs                   rhs                  support confidence     lift count
## [1] {citrus fruit,                                                              
##      root vegetables,                                                           
##      other vegetables,                                                          
##      whole milk}       => {tropical fruit} 0.003152008  0.5438596 5.183004    31
## [2] {citrus fruit,                                                              
##      other vegetables,                                                          
##      whole milk,                                                                
##      yogurt}           => {tropical fruit} 0.002440264  0.5106383 4.866403    24
## [3] {other vegetables,                                                          
##      whole milk,                                                                
##      butter,                                                                    
##      yogurt}           => {tropical fruit} 0.002338587  0.5348837 5.097463    23
## [4] {root vegetables,                                                           
##      yogurt,                                                                    
##      bottled water}    => {tropical fruit} 0.002236909  0.5789474 5.517391    22
## [5] {pip fruit,                                                                 
##      grapes}           => {tropical fruit} 0.002135231  0.5675676 5.408941    21
## [6] {grapes,                                                                    
##      other vegetables,                                                          
##      whole milk}       => {tropical fruit} 0.002033554  0.5263158 5.015810    20

Tropical fruits are chosen by consumers who also buy:
citrus + root vegetables + other vegetables + milk -> tropical fruits
citrus + other vegetables + milk + yogurt -> tropical fruits
other vegetables + milk + butter + yogurt-> tropical fruits

In the analytical part we calculated how yogurt and tropical fruits are related to other products, but it was only part of given task. Tropical fruits are bought by people who buy vegetables and other fruits. So should we put these products on the same shelf for easier access or maybe in the opposite corners of the shop to force clients to go through the whole shop? This is a point where data science meets the substantive expertise…

Vizualisation

Obtaining results in plain text may be informative but not very attractive, so we will visualize obtained results.

plot(rules, method="paracoord", control=list(reorder=TRUE))

The plot above shows dependencies between between products. We can see that the most red arrows go to the other vegetables category.

We can also visualize the lift and support relationship between chosen products (yogurt and fruits). Size of the circle indicate the support of the given rule and color’s strengh shows the lift measure.

plot(rules.fruit, method="grouped") 

plot(rules.yogurt, method="grouped") 

Other applications of association rules

Although using association analysis in order to maximize the amount of sold products is the most obvious application, it is not a only one. There are some cases when company wants to minimize the amount of sales of the one given product. The reason for such situation is when margin on this product is not satisfactory or even negative, but company cannot increase the price or withdraw it. It may seem like a very rare case, but in fact it is not. An example of such situation may be a pharmaceutical company which is selling many groups of products. Unfortunately for them, their margin on oncology products is negative because the competition has lower production costs or for whatever other reason. They cannot simply withdraw the product because it is necessary to sustain patient lifes and it is bad from moral reasons so it would end up with lower trust to the company.
What can be done is to use association rules to find with which products oncology is selling the best and cut off any form of bundling between them, to reduce the sales without drastic actions.