Association rules

Association rules is a method used for exploring relations between variables in big data sets. It is often used in basket analysis in sales to check if there are some general patterns in customers behaviour eg. if customers buy product X, they buy also product Y. Association rules methods compute probabilities of such events and provide data which can be used in demand/supply analysis, products distribution in shopping malls, setting discounts etc.

Dataset exploration

Dataset used in this article is downloaded from kaggle platform. https://www.kaggle.com/gorkhachatryan01/purchase-behaviour

data <- read.csv("products.csv", header = FALSE)
head(data)
##           V1 V2            V3
## 1 2000-01-01  1        yogurt
## 2 2000-01-01  1          pork
## 3 2000-01-01  1 sandwich bags
## 4 2000-01-01  1    lunch meat
## 5 2000-01-01  1  all- purpose
## 6 2000-01-01  1         flour

Data is given in a “single” format, which means that each row responds to one product. We can collect customer’s purchase by id.

Let’s check number of transactions and single purchased products:

length(unique(data$V2))
## [1] 1139

There are 1139 baskets.

length(data$V3)
## [1] 22343

There are 22343 products sold.

length(unique(data$V3))
## [1] 38

There are 38 unique products.

In order to run association rules method, dataset need to be transposed in “transaction” format. I will use arules library.

library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
trans<-read.transactions("products.csv", format = "single", sep=",", cols = c(2,3))
summary(trans)
## transactions as itemMatrix in sparse format with
##  1139 rows (elements/itemsets/transactions) and
##  38 columns (items) and a density of 0.3870662 
## 
## most frequent items:
## vegetables    poultry  ice cream    cereals lunch meat    (Other) 
##        842        480        454        451        450      14076 
## 
## element (itemset/transaction) length distribution:
## sizes
##  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
## 12 34 35 41 51 56 62 67 48 55 71 64 58 79 77 75 91 54 58 25 16  7  3 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00   10.00   15.00   14.71   19.00   26.00 
## 
## includes extended item information - examples:
##          labels
## 1  all- purpose
## 2 aluminum foil
## 3        bagels
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2            10
## 3           100

As I already checked, there are 1139 itemset and 38 kind of products in the dataset. Median of the lenght of a single basket is 15. Vegatebles are the most frequently purchased product.

Let’s visualise item frequency using arulesViz library.

library(arulesViz)
## Loading required package: grid
## Registered S3 method overwritten by 'seriation':
##   method         from 
##   reorder.hclust gclus
itemFrequencyPlot(trans, topN=10, type="absolute", main="Items Frequency") 

The plot shows that products frequency is quite even excluding vegetables.

head(sort(itemFrequency(trans, type="absolute"), decreasing=TRUE), n=38)
##                   vegetables                      poultry 
##                          842                          480 
##                    ice cream                      cereals 
##                          454                          451 
##                   lunch meat                      waffles 
##                          450                          449 
##                      cheeses                         soda 
##                          445                          445 
##                         eggs                 dinner rolls 
##                          444                          443 
## dishwashing liquid/detergent                       bagels 
##                          442                          439 
##                aluminum foil                       yogurt 
##                          438                          438 
##                         milk                   coffee/tea 
##                          433                          432 
##                         soap            laundry detergent 
##                          432                          431 
##                 toilet paper                        juice 
##                          431                          429 
##             individual meals                        mixes 
##                          428                          428 
##                 all- purpose                         beef 
##                          427                          427 
##              spaghetti sauce                      ketchup 
##                          425                          423 
##                        pasta                       fruits 
##                          423                          422 
##                    tortillas                      shampoo 
##                          421                          420 
##                       butter                sandwich bags 
##                          419                          419 
##                 paper towels                        sugar 
##                          413                          411 
##                         pork                        flour 
##                          405                          402 
##              sandwich loaves                    hand soap 
##                          398                          394

Item frequency ranges from 394 to 480, excluding vegetables.

Eclat algorithm

In order to obtain rules from dataset, apriori algorithm has to be used. Dataset contains 1139 rows and 38 columns, which gives 43282 elements. I will use Eclat algorithm (faster version of apriori) as it is more suitable for big datasets. Eclat detects the most probable itemsets compositions.

rules <-eclat(trans, parameter=list(supp=0.15, maxlen = 10)) 
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target   ext
##     FALSE    0.15      1     10 frequent itemsets FALSE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 170 
## 
## create itemset ... 
## set transactions ...[38 item(s), 1139 transaction(s)] done [0.00s].
## sorting and recoding items ... [38 item(s)] done [0.00s].
## creating bit matrix ... [38 row(s), 1139 column(s)] done [0.00s].
## writing  ... [556 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

Algorithm found 556 itemsets with the max length equal to 10 and support on the level of 0.15.

inspect(head(sort(rules, by = "support"), 10))
##      items          support   count
## [1]  {vegetables}   0.7392450 842  
## [2]  {poultry}      0.4214223 480  
## [3]  {ice cream}    0.3985953 454  
## [4]  {cereals}      0.3959614 451  
## [5]  {lunch meat}   0.3950834 450  
## [6]  {waffles}      0.3942054 449  
## [7]  {cheeses}      0.3906936 445  
## [8]  {soda}         0.3906936 445  
## [9]  {eggs}         0.3898156 444  
## [10] {dinner rolls} 0.3889377 443

Suopport relates to the probability of basket occurance, whereas count is a number of basket occurance in dataset. According to the frequency plot, vegetables are in most baskets.

Next step is to look for rule patterns (if A then B). Let’s look for the most probable rules.

freq_rules<-ruleInduction(rules, trans, confidence=0.8)
inspect(head(sort(freq_rules, by = "confidence", decreasing = TRUE),10))
##      lhs                                       rhs          support  
## [1]  {eggs,yogurt}                          => {vegetables} 0.1571554
## [2]  {dinner rolls,eggs}                    => {vegetables} 0.1562774
## [3]  {dishwashing liquid/detergent,eggs}    => {vegetables} 0.1536435
## [4]  {cereals,laundry detergent}            => {vegetables} 0.1510097
## [5]  {cheeses,eggs}                         => {vegetables} 0.1501317
## [6]  {eggs,poultry}                         => {vegetables} 0.1553995
## [7]  {cereals,eggs}                         => {vegetables} 0.1510097
## [8]  {aluminum foil,yogurt}                 => {vegetables} 0.1527656
## [9]  {mixes,poultry}                        => {vegetables} 0.1562774
## [10] {dishwashing liquid/detergent,poultry} => {vegetables} 0.1597893
##      confidence lift     itemset
## [1]  0.8994975  1.216779 476    
## [2]  0.8989899  1.216092 454    
## [3]  0.8974359  1.213990 428    
## [4]  0.8911917  1.205543 300    
## [5]  0.8860104  1.198534 501    
## [6]  0.8805970  1.191211 508    
## [7]  0.8643216  1.169195 507    
## [8]  0.8613861  1.165224 442    
## [9]  0.8599034  1.163218 387    
## [10] 0.8544601  1.155855 429

Confidence is a measure which indicates the probability of A and B event divided by probability of A. The results show that if a customer buys eggs and yogurt, they will buy vegetables with the probability of 0.89. Due to the fact that most of the baskets contains vegetables, most of the rules contain vegetables as a consequent.

Let’s check the rules with the highest support.

inspect(sort(freq_rules, by = "support", decreasing = TRUE))
##      lhs                                       rhs          support  
## [1]  {eggs}                                 => {vegetables} 0.3266023
## [2]  {yogurt}                               => {vegetables} 0.3195786
## [3]  {aluminum foil}                        => {vegetables} 0.3107989
## [4]  {laundry detergent}                    => {vegetables} 0.3090430
## [5]  {sugar}                                => {vegetables} 0.2976295
## [6]  {sandwich loaves}                      => {vegetables} 0.2827041
## [7]  {dinner rolls,poultry}                 => {vegetables} 0.1615452
## [8]  {dishwashing liquid/detergent,poultry} => {vegetables} 0.1597893
## [9]  {eggs,soda}                            => {vegetables} 0.1580334
## [10] {lunch meat,poultry}                   => {vegetables} 0.1580334
## [11] {eggs,yogurt}                          => {vegetables} 0.1571554
## [12] {lunch meat,waffles}                   => {vegetables} 0.1571554
## [13] {mixes,poultry}                        => {vegetables} 0.1562774
## [14] {dinner rolls,eggs}                    => {vegetables} 0.1562774
## [15] {eggs,poultry}                         => {vegetables} 0.1553995
## [16] {dishwashing liquid/detergent,eggs}    => {vegetables} 0.1536435
## [17] {aluminum foil,yogurt}                 => {vegetables} 0.1527656
## [18] {poultry,yogurt}                       => {vegetables} 0.1527656
## [19] {poultry,sugar}                        => {vegetables} 0.1518876
## [20] {cereals,laundry detergent}            => {vegetables} 0.1510097
## [21] {cereals,eggs}                         => {vegetables} 0.1510097
## [22] {cheeses,eggs}                         => {vegetables} 0.1501317
##      confidence lift     itemset
## [1]  0.8378378  1.133370 509    
## [2]  0.8310502  1.124188 478    
## [3]  0.8082192  1.093304 443    
## [4]  0.8167053  1.104783 301    
## [5]  0.8248175  1.115757  60    
## [6]  0.8090452  1.094421   5    
## [7]  0.8288288  1.121183 455    
## [8]  0.8544601  1.155855 429    
## [9]  0.8450704  1.143153 466    
## [10] 0.8490566  1.148546 494    
## [11] 0.8994975  1.216779 476    
## [12] 0.8523810  1.153043 493    
## [13] 0.8599034  1.163218 387    
## [14] 0.8989899  1.216092 454    
## [15] 0.8805970  1.191211 508    
## [16] 0.8974359  1.213990 428    
## [17] 0.8613861  1.165224 442    
## [18] 0.8446602  1.142599 477    
## [19] 0.8480392  1.147169  59    
## [20] 0.8911917  1.205543 300    
## [21] 0.8643216  1.169195 507    
## [22] 0.8860104  1.198534 501

Support measure indicates the probability od occurance A and B events. Purchasing eggs and yogurt separetely bring on purchasing vegetables with the highest support. Aluminium foil, laundry detergent, sugar and sandwich loaves are also antecedent of purchasing vegetables with quite high support.

Let’s check the lift measure.

inspect(head(sort(freq_rules, by = "lift", decreasing = TRUE), 10))
##      lhs                                       rhs          support  
## [1]  {eggs,yogurt}                          => {vegetables} 0.1571554
## [2]  {dinner rolls,eggs}                    => {vegetables} 0.1562774
## [3]  {dishwashing liquid/detergent,eggs}    => {vegetables} 0.1536435
## [4]  {cereals,laundry detergent}            => {vegetables} 0.1510097
## [5]  {cheeses,eggs}                         => {vegetables} 0.1501317
## [6]  {eggs,poultry}                         => {vegetables} 0.1553995
## [7]  {cereals,eggs}                         => {vegetables} 0.1510097
## [8]  {aluminum foil,yogurt}                 => {vegetables} 0.1527656
## [9]  {mixes,poultry}                        => {vegetables} 0.1562774
## [10] {dishwashing liquid/detergent,poultry} => {vegetables} 0.1597893
##      confidence lift     itemset
## [1]  0.8994975  1.216779 476    
## [2]  0.8989899  1.216092 454    
## [3]  0.8974359  1.213990 428    
## [4]  0.8911917  1.205543 300    
## [5]  0.8860104  1.198534 501    
## [6]  0.8805970  1.191211 508    
## [7]  0.8643216  1.169195 507    
## [8]  0.8613861  1.165224 442    
## [9]  0.8599034  1.163218 387    
## [10] 0.8544601  1.155855 429

Lift stands for probability of A and B event divided by probability of A and B events separetely. If lift >1, it is more probable that events A and B occur together than seperately. In this case, it is more probable that a customer buys eggs, yogurt and vegetable all together than just eggs and yogurt.

inspect(head(sort(freq_rules, by = "itemset", decreasing = TRUE), 10))
##      lhs                     rhs          support   confidence lift     itemset
## [1]  {eggs}               => {vegetables} 0.3266023 0.8378378  1.133370 509    
## [2]  {eggs,poultry}       => {vegetables} 0.1553995 0.8805970  1.191211 508    
## [3]  {cereals,eggs}       => {vegetables} 0.1510097 0.8643216  1.169195 507    
## [4]  {cheeses,eggs}       => {vegetables} 0.1501317 0.8860104  1.198534 501    
## [5]  {lunch meat,poultry} => {vegetables} 0.1580334 0.8490566  1.148546 494    
## [6]  {lunch meat,waffles} => {vegetables} 0.1571554 0.8523810  1.153043 493    
## [7]  {yogurt}             => {vegetables} 0.3195786 0.8310502  1.124188 478    
## [8]  {poultry,yogurt}     => {vegetables} 0.1527656 0.8446602  1.142599 477    
## [9]  {eggs,yogurt}        => {vegetables} 0.1571554 0.8994975  1.216779 476    
## [10] {eggs,soda}          => {vegetables} 0.1580334 0.8450704  1.143153 466

Itemset is a number of baskets with given products. The most frequent basket contain eggs and vegetables.

Let’s plot the results using arulesViz library.

library(arulesViz)
plot(freq_rules, method="grouped") 

plot(freq_rules, measure=c("support", "confidence"), shading="lift", interactive=FALSE)
## Warning in plot.rules(freq_rules, measure = c("support", "confidence"), : The
## parameter interactive is deprecated. Use engine='interactive' instead.

We can observe that the higher lift, the higher confidence.

Association rules by item

As vegetables is the most frequent product in the basket analysis, we cannot observe any rules, which do not contain them. Let’s check rules by item, so that more diverse patterns could be discovered.

Let’s start from poultry, as it is the second most frequently purchased product. I will use apriori algorithm, as it allows to find rules by product. In order to do it, we have to lower confidence value.

poultry<-apriori(data=trans, parameter=list(supp=0.1,conf = 0.48), 
                      appearance=list(default="lhs", rhs="poultry"), control=list(verbose=F)) 
inspect(head(sort(poultry, by="support", decreasing=TRUE),10))
##      lhs                                          rhs       support  
## [1]  {dinner rolls}                            => {poultry} 0.1949078
## [2]  {dishwashing liquid/detergent}            => {poultry} 0.1870061
## [3]  {soap}                                    => {poultry} 0.1834943
## [4]  {mixes}                                   => {poultry} 0.1817384
## [5]  {sugar}                                   => {poultry} 0.1791045
## [6]  {dinner rolls,vegetables}                 => {poultry} 0.1615452
## [7]  {dishwashing liquid/detergent,vegetables} => {poultry} 0.1597893
## [8]  {lunch meat,vegetables}                   => {poultry} 0.1580334
## [9]  {mixes,vegetables}                        => {poultry} 0.1562774
## [10] {sugar,vegetables}                        => {poultry} 0.1518876
##      confidence lift     count
## [1]  0.5011287  1.189137 222  
## [2]  0.4819005  1.143510 213  
## [3]  0.4837963  1.148008 209  
## [4]  0.4836449  1.147649 207  
## [5]  0.4963504  1.177798 204  
## [6]  0.5242165  1.243922 184  
## [7]  0.5214900  1.237452 182  
## [8]  0.5070423  1.203169 180  
## [9]  0.5281899  1.253351 178  
## [10] 0.5103245  1.210957 173

The results show that polltry is purchased with dinner rolls, detergents, soap, and mixes with the probability around 0.2.

Let’s plot the rules.

plot(poultry, method="graph")

Ice cream is the third most frequent product.

ice_cream<-apriori(data=trans, parameter=list(supp=0.1,conf = 0.45), 
                      appearance=list(default="lhs", rhs="ice cream"), control=list(verbose=F)) 
inspect(head(sort(ice_cream, by="support", decreasing=TRUE),10))
##      lhs                             rhs         support   confidence lift    
## [1]  {cheeses}                    => {ice cream} 0.1791045 0.4584270  1.150106
## [2]  {aluminum foil}              => {ice cream} 0.1764706 0.4589041  1.151303
## [3]  {paper towels}               => {ice cream} 0.1703248 0.4697337  1.178473
## [4]  {pasta}                      => {ice cream} 0.1676910 0.4515366  1.132820
## [5]  {sandwich loaves}            => {ice cream} 0.1580334 0.4522613  1.134638
## [6]  {lunch meat,vegetables}      => {ice cream} 0.1466198 0.4704225  1.180201
## [7]  {aluminum foil,vegetables}   => {ice cream} 0.1457419 0.4689266  1.176448
## [8]  {cheeses,vegetables}         => {ice cream} 0.1448639 0.4687500  1.176005
## [9]  {paper towels,vegetables}    => {ice cream} 0.1378402 0.4757576  1.193586
## [10] {sandwich loaves,vegetables} => {ice cream} 0.1343284 0.4751553  1.192075
##      count
## [1]  204  
## [2]  201  
## [3]  194  
## [4]  191  
## [5]  180  
## [6]  167  
## [7]  166  
## [8]  165  
## [9]  157  
## [10] 153

Ice cream are purchased with cheeses, alumni foil, paper towels and pasta with the support around 0.17.

plot(ice_cream, method="graph")

Cereals:

cereal<-apriori(data=trans, parameter=list(supp=0.1,conf = 0.45), 
                      appearance=list(default="lhs", rhs="cereals"), control=list(verbose=F)) 
inspect(head(sort(cereal, by="support", decreasing=TRUE),10))
##      lhs                                          rhs       support  
## [1]  {milk}                                    => {cereals} 0.1738367
## [2]  {mixes}                                   => {cereals} 0.1738367
## [3]  {paper towels}                            => {cereals} 0.1641791
## [4]  {laundry detergent,vegetables}            => {cereals} 0.1510097
## [5]  {eggs,vegetables}                         => {cereals} 0.1510097
## [6]  {mixes,vegetables}                        => {cereals} 0.1413521
## [7]  {dinner rolls,vegetables}                 => {cereals} 0.1413521
## [8]  {lunch meat,vegetables}                   => {cereals} 0.1404741
## [9]  {dishwashing liquid/detergent,vegetables} => {cereals} 0.1387182
## [10] {spaghetti sauce,vegetables}              => {cereals} 0.1369622
##      confidence lift     count
## [1]  0.4572748  1.154847 198  
## [2]  0.4626168  1.168338 198  
## [3]  0.4527845  1.143507 187  
## [4]  0.4886364  1.234051 172  
## [5]  0.4623656  1.167704 172  
## [6]  0.4777448  1.206544 161  
## [7]  0.4586895  1.158420 161  
## [8]  0.4507042  1.138253 160  
## [9]  0.4527221  1.143349 158  
## [10] 0.4615385  1.165615 156

Cereals are mostly purchased with milk, mixes and paper towels.

plot(cereal, method="graph")

Similarity and dissimilarity

There is also a possibility to measure dissimilarity of products using Jaccard index. It is based on the probability calcus and computed with a formula - (p(A∩B)-p(A∪B))/p(A∪B).

Let’s check product dissimilarity. I will choose the most frequent products.

df<-trans[,itemFrequency(trans)>0.39]
J_index<-dissimilarity(df, which="items") 
round(J_index,digits=3)
##            cereals cheeses ice cream lunch meat poultry  soda vegetables
## cheeses      0.714                                                      
## ice cream    0.736   0.706                                              
## lunch meat   0.729   0.743     0.720                                    
## poultry      0.716   0.713     0.721      0.705                         
## soda         0.752   0.725     0.728      0.745   0.729                 
## vegetables   0.623   0.624     0.637      0.621   0.600 0.629           
## waffles      0.745   0.717     0.721      0.695   0.743 0.708      0.615

The most dissimilar products are cereals and soda, cereals and waffles and lunch meat and soda.

We can plot dissimilarity using dendrogram.

plot(hclust(J_index, method = "ward.D2"), main = "Products dendrogram")

In order to check similarity of items, let’s use Affinity measure.

a = affinity(df)
round(a, digits=3)
## An object of class "ar_similarity"
##            cereals cheeses ice cream lunch meat poultry  soda vegetables
## cereals      0.000   0.286     0.264      0.271   0.284 0.248      0.377
## cheeses      0.286   0.000     0.294      0.257   0.287 0.275      0.376
## ice cream    0.264   0.294     0.000      0.280   0.279 0.272      0.363
## lunch meat   0.271   0.257     0.280      0.000   0.295 0.255      0.379
## poultry      0.284   0.287     0.279      0.295   0.000 0.271      0.400
## soda         0.248   0.275     0.272      0.255   0.271 0.000      0.371
## vegetables   0.377   0.376     0.363      0.379   0.400 0.371      0.000
## waffles      0.255   0.283     0.279      0.305   0.257 0.292      0.385
##            waffles
## cereals      0.255
## cheeses      0.283
## ice cream    0.279
## lunch meat   0.305
## poultry      0.257
## soda         0.292
## vegetables   0.385
## waffles      0.000
## Slot "method":
## [1] "Affinity"

The least probable itemset contains soda and cereals.

Conclusions

Above analysis shows the patterns of products purchasing. The most probable baskets of products contain vegetable, eggs and yogurt, dinner rolls, eggs and vegetable and detergent, eggs and vegetables. To conclude, we can observe a strong pattern of vegetable and eggs. Among other products, poultry is purchased along with dinner rolls, ice cream with cheeses and cereals with milk.