Association rules is a method used for exploring relations between variables in big data sets. It is often used in basket analysis in sales to check if there are some general patterns in customers behaviour eg. if customers buy product X, they buy also product Y. Association rules methods compute probabilities of such events and provide data which can be used in demand/supply analysis, products distribution in shopping malls, setting discounts etc.
Dataset used in this article is downloaded from kaggle platform. https://www.kaggle.com/gorkhachatryan01/purchase-behaviour
data <- read.csv("products.csv", header = FALSE)
head(data)
## V1 V2 V3
## 1 2000-01-01 1 yogurt
## 2 2000-01-01 1 pork
## 3 2000-01-01 1 sandwich bags
## 4 2000-01-01 1 lunch meat
## 5 2000-01-01 1 all- purpose
## 6 2000-01-01 1 flour
Data is given in a “single” format, which means that each row responds to one product. We can collect customer’s purchase by id.
Let’s check number of transactions and single purchased products:
length(unique(data$V2))
## [1] 1139
There are 1139 baskets.
length(data$V3)
## [1] 22343
There are 22343 products sold.
length(unique(data$V3))
## [1] 38
There are 38 unique products.
In order to run association rules method, dataset need to be transposed in “transaction” format. I will use arules library.
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
trans<-read.transactions("products.csv", format = "single", sep=",", cols = c(2,3))
summary(trans)
## transactions as itemMatrix in sparse format with
## 1139 rows (elements/itemsets/transactions) and
## 38 columns (items) and a density of 0.3870662
##
## most frequent items:
## vegetables poultry ice cream cereals lunch meat (Other)
## 842 480 454 451 450 14076
##
## element (itemset/transaction) length distribution:
## sizes
## 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 12 34 35 41 51 56 62 67 48 55 71 64 58 79 77 75 91 54 58 25 16 7 3
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 10.00 15.00 14.71 19.00 26.00
##
## includes extended item information - examples:
## labels
## 1 all- purpose
## 2 aluminum foil
## 3 bagels
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 10
## 3 100
As I already checked, there are 1139 itemset and 38 kind of products in the dataset. Median of the lenght of a single basket is 15. Vegatebles are the most frequently purchased product.
Let’s visualise item frequency using arulesViz library.
library(arulesViz)
## Loading required package: grid
## Registered S3 method overwritten by 'seriation':
## method from
## reorder.hclust gclus
itemFrequencyPlot(trans, topN=10, type="absolute", main="Items Frequency")
The plot shows that products frequency is quite even excluding vegetables.
head(sort(itemFrequency(trans, type="absolute"), decreasing=TRUE), n=38)
## vegetables poultry
## 842 480
## ice cream cereals
## 454 451
## lunch meat waffles
## 450 449
## cheeses soda
## 445 445
## eggs dinner rolls
## 444 443
## dishwashing liquid/detergent bagels
## 442 439
## aluminum foil yogurt
## 438 438
## milk coffee/tea
## 433 432
## soap laundry detergent
## 432 431
## toilet paper juice
## 431 429
## individual meals mixes
## 428 428
## all- purpose beef
## 427 427
## spaghetti sauce ketchup
## 425 423
## pasta fruits
## 423 422
## tortillas shampoo
## 421 420
## butter sandwich bags
## 419 419
## paper towels sugar
## 413 411
## pork flour
## 405 402
## sandwich loaves hand soap
## 398 394
Item frequency ranges from 394 to 480, excluding vegetables.
In order to obtain rules from dataset, apriori algorithm has to be used. Dataset contains 1139 rows and 38 columns, which gives 43282 elements. I will use Eclat algorithm (faster version of apriori) as it is more suitable for big datasets. Eclat detects the most probable itemsets compositions.
rules <-eclat(trans, parameter=list(supp=0.15, maxlen = 10))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.15 1 10 frequent itemsets FALSE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 170
##
## create itemset ...
## set transactions ...[38 item(s), 1139 transaction(s)] done [0.00s].
## sorting and recoding items ... [38 item(s)] done [0.00s].
## creating bit matrix ... [38 row(s), 1139 column(s)] done [0.00s].
## writing ... [556 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
Algorithm found 556 itemsets with the max length equal to 10 and support on the level of 0.15.
inspect(head(sort(rules, by = "support"), 10))
## items support count
## [1] {vegetables} 0.7392450 842
## [2] {poultry} 0.4214223 480
## [3] {ice cream} 0.3985953 454
## [4] {cereals} 0.3959614 451
## [5] {lunch meat} 0.3950834 450
## [6] {waffles} 0.3942054 449
## [7] {cheeses} 0.3906936 445
## [8] {soda} 0.3906936 445
## [9] {eggs} 0.3898156 444
## [10] {dinner rolls} 0.3889377 443
Suopport relates to the probability of basket occurance, whereas count is a number of basket occurance in dataset. According to the frequency plot, vegetables are in most baskets.
Next step is to look for rule patterns (if A then B). Let’s look for the most probable rules.
freq_rules<-ruleInduction(rules, trans, confidence=0.8)
inspect(head(sort(freq_rules, by = "confidence", decreasing = TRUE),10))
## lhs rhs support
## [1] {eggs,yogurt} => {vegetables} 0.1571554
## [2] {dinner rolls,eggs} => {vegetables} 0.1562774
## [3] {dishwashing liquid/detergent,eggs} => {vegetables} 0.1536435
## [4] {cereals,laundry detergent} => {vegetables} 0.1510097
## [5] {cheeses,eggs} => {vegetables} 0.1501317
## [6] {eggs,poultry} => {vegetables} 0.1553995
## [7] {cereals,eggs} => {vegetables} 0.1510097
## [8] {aluminum foil,yogurt} => {vegetables} 0.1527656
## [9] {mixes,poultry} => {vegetables} 0.1562774
## [10] {dishwashing liquid/detergent,poultry} => {vegetables} 0.1597893
## confidence lift itemset
## [1] 0.8994975 1.216779 476
## [2] 0.8989899 1.216092 454
## [3] 0.8974359 1.213990 428
## [4] 0.8911917 1.205543 300
## [5] 0.8860104 1.198534 501
## [6] 0.8805970 1.191211 508
## [7] 0.8643216 1.169195 507
## [8] 0.8613861 1.165224 442
## [9] 0.8599034 1.163218 387
## [10] 0.8544601 1.155855 429
Confidence is a measure which indicates the probability of A and B event divided by probability of A. The results show that if a customer buys eggs and yogurt, they will buy vegetables with the probability of 0.89. Due to the fact that most of the baskets contains vegetables, most of the rules contain vegetables as a consequent.
Let’s check the rules with the highest support.
inspect(sort(freq_rules, by = "support", decreasing = TRUE))
## lhs rhs support
## [1] {eggs} => {vegetables} 0.3266023
## [2] {yogurt} => {vegetables} 0.3195786
## [3] {aluminum foil} => {vegetables} 0.3107989
## [4] {laundry detergent} => {vegetables} 0.3090430
## [5] {sugar} => {vegetables} 0.2976295
## [6] {sandwich loaves} => {vegetables} 0.2827041
## [7] {dinner rolls,poultry} => {vegetables} 0.1615452
## [8] {dishwashing liquid/detergent,poultry} => {vegetables} 0.1597893
## [9] {eggs,soda} => {vegetables} 0.1580334
## [10] {lunch meat,poultry} => {vegetables} 0.1580334
## [11] {eggs,yogurt} => {vegetables} 0.1571554
## [12] {lunch meat,waffles} => {vegetables} 0.1571554
## [13] {mixes,poultry} => {vegetables} 0.1562774
## [14] {dinner rolls,eggs} => {vegetables} 0.1562774
## [15] {eggs,poultry} => {vegetables} 0.1553995
## [16] {dishwashing liquid/detergent,eggs} => {vegetables} 0.1536435
## [17] {aluminum foil,yogurt} => {vegetables} 0.1527656
## [18] {poultry,yogurt} => {vegetables} 0.1527656
## [19] {poultry,sugar} => {vegetables} 0.1518876
## [20] {cereals,laundry detergent} => {vegetables} 0.1510097
## [21] {cereals,eggs} => {vegetables} 0.1510097
## [22] {cheeses,eggs} => {vegetables} 0.1501317
## confidence lift itemset
## [1] 0.8378378 1.133370 509
## [2] 0.8310502 1.124188 478
## [3] 0.8082192 1.093304 443
## [4] 0.8167053 1.104783 301
## [5] 0.8248175 1.115757 60
## [6] 0.8090452 1.094421 5
## [7] 0.8288288 1.121183 455
## [8] 0.8544601 1.155855 429
## [9] 0.8450704 1.143153 466
## [10] 0.8490566 1.148546 494
## [11] 0.8994975 1.216779 476
## [12] 0.8523810 1.153043 493
## [13] 0.8599034 1.163218 387
## [14] 0.8989899 1.216092 454
## [15] 0.8805970 1.191211 508
## [16] 0.8974359 1.213990 428
## [17] 0.8613861 1.165224 442
## [18] 0.8446602 1.142599 477
## [19] 0.8480392 1.147169 59
## [20] 0.8911917 1.205543 300
## [21] 0.8643216 1.169195 507
## [22] 0.8860104 1.198534 501
Support measure indicates the probability od occurance A and B events. Purchasing eggs and yogurt separetely bring on purchasing vegetables with the highest support. Aluminium foil, laundry detergent, sugar and sandwich loaves are also antecedent of purchasing vegetables with quite high support.
Let’s check the lift measure.
inspect(head(sort(freq_rules, by = "lift", decreasing = TRUE), 10))
## lhs rhs support
## [1] {eggs,yogurt} => {vegetables} 0.1571554
## [2] {dinner rolls,eggs} => {vegetables} 0.1562774
## [3] {dishwashing liquid/detergent,eggs} => {vegetables} 0.1536435
## [4] {cereals,laundry detergent} => {vegetables} 0.1510097
## [5] {cheeses,eggs} => {vegetables} 0.1501317
## [6] {eggs,poultry} => {vegetables} 0.1553995
## [7] {cereals,eggs} => {vegetables} 0.1510097
## [8] {aluminum foil,yogurt} => {vegetables} 0.1527656
## [9] {mixes,poultry} => {vegetables} 0.1562774
## [10] {dishwashing liquid/detergent,poultry} => {vegetables} 0.1597893
## confidence lift itemset
## [1] 0.8994975 1.216779 476
## [2] 0.8989899 1.216092 454
## [3] 0.8974359 1.213990 428
## [4] 0.8911917 1.205543 300
## [5] 0.8860104 1.198534 501
## [6] 0.8805970 1.191211 508
## [7] 0.8643216 1.169195 507
## [8] 0.8613861 1.165224 442
## [9] 0.8599034 1.163218 387
## [10] 0.8544601 1.155855 429
Lift stands for probability of A and B event divided by probability of A and B events separetely. If lift >1, it is more probable that events A and B occur together than seperately. In this case, it is more probable that a customer buys eggs, yogurt and vegetable all together than just eggs and yogurt.
inspect(head(sort(freq_rules, by = "itemset", decreasing = TRUE), 10))
## lhs rhs support confidence lift itemset
## [1] {eggs} => {vegetables} 0.3266023 0.8378378 1.133370 509
## [2] {eggs,poultry} => {vegetables} 0.1553995 0.8805970 1.191211 508
## [3] {cereals,eggs} => {vegetables} 0.1510097 0.8643216 1.169195 507
## [4] {cheeses,eggs} => {vegetables} 0.1501317 0.8860104 1.198534 501
## [5] {lunch meat,poultry} => {vegetables} 0.1580334 0.8490566 1.148546 494
## [6] {lunch meat,waffles} => {vegetables} 0.1571554 0.8523810 1.153043 493
## [7] {yogurt} => {vegetables} 0.3195786 0.8310502 1.124188 478
## [8] {poultry,yogurt} => {vegetables} 0.1527656 0.8446602 1.142599 477
## [9] {eggs,yogurt} => {vegetables} 0.1571554 0.8994975 1.216779 476
## [10] {eggs,soda} => {vegetables} 0.1580334 0.8450704 1.143153 466
Itemset is a number of baskets with given products. The most frequent basket contain eggs and vegetables.
Let’s plot the results using arulesViz library.
library(arulesViz)
plot(freq_rules, method="grouped")
plot(freq_rules, measure=c("support", "confidence"), shading="lift", interactive=FALSE)
## Warning in plot.rules(freq_rules, measure = c("support", "confidence"), : The
## parameter interactive is deprecated. Use engine='interactive' instead.
We can observe that the higher lift, the higher confidence.
As vegetables is the most frequent product in the basket analysis, we cannot observe any rules, which do not contain them. Let’s check rules by item, so that more diverse patterns could be discovered.
Let’s start from poultry, as it is the second most frequently purchased product. I will use apriori algorithm, as it allows to find rules by product. In order to do it, we have to lower confidence value.
poultry<-apriori(data=trans, parameter=list(supp=0.1,conf = 0.48),
appearance=list(default="lhs", rhs="poultry"), control=list(verbose=F))
inspect(head(sort(poultry, by="support", decreasing=TRUE),10))
## lhs rhs support
## [1] {dinner rolls} => {poultry} 0.1949078
## [2] {dishwashing liquid/detergent} => {poultry} 0.1870061
## [3] {soap} => {poultry} 0.1834943
## [4] {mixes} => {poultry} 0.1817384
## [5] {sugar} => {poultry} 0.1791045
## [6] {dinner rolls,vegetables} => {poultry} 0.1615452
## [7] {dishwashing liquid/detergent,vegetables} => {poultry} 0.1597893
## [8] {lunch meat,vegetables} => {poultry} 0.1580334
## [9] {mixes,vegetables} => {poultry} 0.1562774
## [10] {sugar,vegetables} => {poultry} 0.1518876
## confidence lift count
## [1] 0.5011287 1.189137 222
## [2] 0.4819005 1.143510 213
## [3] 0.4837963 1.148008 209
## [4] 0.4836449 1.147649 207
## [5] 0.4963504 1.177798 204
## [6] 0.5242165 1.243922 184
## [7] 0.5214900 1.237452 182
## [8] 0.5070423 1.203169 180
## [9] 0.5281899 1.253351 178
## [10] 0.5103245 1.210957 173
The results show that polltry is purchased with dinner rolls, detergents, soap, and mixes with the probability around 0.2.
Let’s plot the rules.
plot(poultry, method="graph")
Ice cream is the third most frequent product.
ice_cream<-apriori(data=trans, parameter=list(supp=0.1,conf = 0.45),
appearance=list(default="lhs", rhs="ice cream"), control=list(verbose=F))
inspect(head(sort(ice_cream, by="support", decreasing=TRUE),10))
## lhs rhs support confidence lift
## [1] {cheeses} => {ice cream} 0.1791045 0.4584270 1.150106
## [2] {aluminum foil} => {ice cream} 0.1764706 0.4589041 1.151303
## [3] {paper towels} => {ice cream} 0.1703248 0.4697337 1.178473
## [4] {pasta} => {ice cream} 0.1676910 0.4515366 1.132820
## [5] {sandwich loaves} => {ice cream} 0.1580334 0.4522613 1.134638
## [6] {lunch meat,vegetables} => {ice cream} 0.1466198 0.4704225 1.180201
## [7] {aluminum foil,vegetables} => {ice cream} 0.1457419 0.4689266 1.176448
## [8] {cheeses,vegetables} => {ice cream} 0.1448639 0.4687500 1.176005
## [9] {paper towels,vegetables} => {ice cream} 0.1378402 0.4757576 1.193586
## [10] {sandwich loaves,vegetables} => {ice cream} 0.1343284 0.4751553 1.192075
## count
## [1] 204
## [2] 201
## [3] 194
## [4] 191
## [5] 180
## [6] 167
## [7] 166
## [8] 165
## [9] 157
## [10] 153
Ice cream are purchased with cheeses, alumni foil, paper towels and pasta with the support around 0.17.
plot(ice_cream, method="graph")
Cereals:
cereal<-apriori(data=trans, parameter=list(supp=0.1,conf = 0.45),
appearance=list(default="lhs", rhs="cereals"), control=list(verbose=F))
inspect(head(sort(cereal, by="support", decreasing=TRUE),10))
## lhs rhs support
## [1] {milk} => {cereals} 0.1738367
## [2] {mixes} => {cereals} 0.1738367
## [3] {paper towels} => {cereals} 0.1641791
## [4] {laundry detergent,vegetables} => {cereals} 0.1510097
## [5] {eggs,vegetables} => {cereals} 0.1510097
## [6] {mixes,vegetables} => {cereals} 0.1413521
## [7] {dinner rolls,vegetables} => {cereals} 0.1413521
## [8] {lunch meat,vegetables} => {cereals} 0.1404741
## [9] {dishwashing liquid/detergent,vegetables} => {cereals} 0.1387182
## [10] {spaghetti sauce,vegetables} => {cereals} 0.1369622
## confidence lift count
## [1] 0.4572748 1.154847 198
## [2] 0.4626168 1.168338 198
## [3] 0.4527845 1.143507 187
## [4] 0.4886364 1.234051 172
## [5] 0.4623656 1.167704 172
## [6] 0.4777448 1.206544 161
## [7] 0.4586895 1.158420 161
## [8] 0.4507042 1.138253 160
## [9] 0.4527221 1.143349 158
## [10] 0.4615385 1.165615 156
Cereals are mostly purchased with milk, mixes and paper towels.
plot(cereal, method="graph")
There is also a possibility to measure dissimilarity of products using Jaccard index. It is based on the probability calcus and computed with a formula - (p(A∩B)-p(A∪B))/p(A∪B).
Let’s check product dissimilarity. I will choose the most frequent products.
df<-trans[,itemFrequency(trans)>0.39]
J_index<-dissimilarity(df, which="items")
round(J_index,digits=3)
## cereals cheeses ice cream lunch meat poultry soda vegetables
## cheeses 0.714
## ice cream 0.736 0.706
## lunch meat 0.729 0.743 0.720
## poultry 0.716 0.713 0.721 0.705
## soda 0.752 0.725 0.728 0.745 0.729
## vegetables 0.623 0.624 0.637 0.621 0.600 0.629
## waffles 0.745 0.717 0.721 0.695 0.743 0.708 0.615
The most dissimilar products are cereals and soda, cereals and waffles and lunch meat and soda.
We can plot dissimilarity using dendrogram.
plot(hclust(J_index, method = "ward.D2"), main = "Products dendrogram")
In order to check similarity of items, let’s use Affinity measure.
a = affinity(df)
round(a, digits=3)
## An object of class "ar_similarity"
## cereals cheeses ice cream lunch meat poultry soda vegetables
## cereals 0.000 0.286 0.264 0.271 0.284 0.248 0.377
## cheeses 0.286 0.000 0.294 0.257 0.287 0.275 0.376
## ice cream 0.264 0.294 0.000 0.280 0.279 0.272 0.363
## lunch meat 0.271 0.257 0.280 0.000 0.295 0.255 0.379
## poultry 0.284 0.287 0.279 0.295 0.000 0.271 0.400
## soda 0.248 0.275 0.272 0.255 0.271 0.000 0.371
## vegetables 0.377 0.376 0.363 0.379 0.400 0.371 0.000
## waffles 0.255 0.283 0.279 0.305 0.257 0.292 0.385
## waffles
## cereals 0.255
## cheeses 0.283
## ice cream 0.279
## lunch meat 0.305
## poultry 0.257
## soda 0.292
## vegetables 0.385
## waffles 0.000
## Slot "method":
## [1] "Affinity"
The least probable itemset contains soda and cereals.
Above analysis shows the patterns of products purchasing. The most probable baskets of products contain vegetable, eggs and yogurt, dinner rolls, eggs and vegetable and detergent, eggs and vegetables. To conclude, we can observe a strong pattern of vegetable and eggs. Among other products, poultry is purchased along with dinner rolls, ice cream with cheeses and cereals with milk.