Market Basket Anaylsis is one of the assosciation rules method. This kind of method is basicly a rule of “if” and “then”. In simple words, if client has a product in a basket, this method specifies what are the other products that client is most likely to buy. E.g. if client buys a beer then they is likely (with probability of 90%) that they will buy a bag of crisps.
Among all the orders we are looking for the most frequent rules (“if” and “then”). Those rules are based on indicators:
This project is basing on instacart dataset from kaggle competition https://www.kaggle.com/c/instacart-market-basket-analysis/data. It consists of 131209 orders and 39123 different products.
trans<-read.transactions("trans.csv", format = "single", sep=",",cols = c("order_id","product_name"))
summary(trans)
## transactions as itemMatrix in sparse format with
## 131209 rows (elements/itemsets/transactions) and
## 39123 columns (items) and a density of 0.0002697329
##
## most frequent items:
## Banana Bag of Organic Bananas Organic Strawberries
## 18726 15480 10894
## Organic Baby Spinach Large Lemon (Other)
## 9784 8135 1321598
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 6845 7368 8033 8218 8895 8708 8541 7983 7217 6553 6034 5383 4843 4394 3831
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## 3522 3108 2719 2473 2102 1857 1681 1462 1292 1079 986 860 679 634 553
## 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
## 446 403 346 315 280 210 193 178 142 99 90 88 75 79 64
## 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 48 49 32 26 31 24 23 18 15 12 10 6 5 4 8
## 61 62 63 64 65 66 67 68 70 72 74 75 76 77 80
## 3 3 5 4 3 2 1 2 4 2 2 1 2 1 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 5.00 9.00 10.55 14.00 80.00
##
## includes extended item information - examples:
## labels
## 1 #2 Coffee Filters
## 2 #2 Cone White Coffee Filters
## 3 #2 Mechanical Pencils
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 100000
## 3 1000008
From the summary of the dataset, the basic informations can be gained. Firstly the most frequent items are:
What is more, the distribution of basket size can be analyzed. In addtiion to that the plot of distribution can be useful.
The most frequent basket is of size 5 and the mean size is equal to 10.6.
group_basket = df %>% group_by(., order_id) %>% summarise(basket_size=n())
basket_sizes = group_basket %>% group_by(.,basket_size) %>% summarise(count=n())
ggplot(basket_sizes, aes(x=basket_size, y=count)) + geom_bar(stat = "identity") + scale_x_continuous(breaks = seq(0, 80, by = 5))
As it was said ‘Bananas’ are the product that customers bought most times. It is possible to plot topN most frequent items.
itemFrequencyPlot(trans,topN=20,type="absolute")
We know what are the most popular products, but trying to know all the products that were rarely bought is hard.
item_freq <- as.data.frame(itemFrequency(trans,type="absolute"), cols = 'product')
colnames(item_freq) <- 'number_of_purchases'
item_freq %>% group_by(.,number_of_purchases) %>% summarise(number_of_products = n()) %>% head(.,5)
## # A tibble: 5 x 2
## number_of_purchases number_of_products
## <int> <int>
## 1 1 7884
## 2 2 4910
## 3 3 3291
## 4 4 2441
## 5 5 1815
The table shows that there are 7884 products that were bought only once, 4910 products that were bought twice and so on.
The rules are based on the support and confidence level, so we have to define the level of those statistics. We need to do so to be able to analyze most frequent rules/patterns.
Firstly, by using eclat algorithm the most frequent item sets will be shown. The default support is 0.1 but in this dataset a lower value is required to obtain any results.
freq_items<-eclat(trans, parameter=list(supp=0.03, maxlen=15))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.03 1 15 frequent itemsets FALSE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 3936
##
## create itemset ...
## set transactions ...[39123 item(s), 131209 transaction(s)] done [1.23s].
## sorting and recoding items ... [17 item(s)] done [0.01s].
## creating sparse bit matrix ... [17 row(s), 131209 column(s)] done [0.02s].
## writing ... [17 set(s)] done [0.01s].
## Creating S4 object ... done [0.00s].
inspect(freq_items)
## items support count
## [1] {Banana} 0.14271887 18726
## [2] {Bag of Organic Bananas} 0.11797971 15480
## [3] {Organic Strawberries} 0.08302784 10894
## [4] {Organic Baby Spinach} 0.07456806 9784
## [5] {Large Lemon} 0.06200032 8135
## [6] {Organic Hass Avocado} 0.05558308 7293
## [7] {Organic Avocado} 0.05646716 7409
## [8] {Limes} 0.04598008 6033
## [9] {Organic Raspberries} 0.04226844 5546
## [10] {Strawberries} 0.04949356 6494
## [11] {Organic Cucumber} 0.03515765 4613
## [12] {Organic Zucchini} 0.03497473 4589
## [13] {Organic Blueberries} 0.03784801 4966
## [14] {Organic Yellow Onion} 0.03269593 4290
## [15] {Organic Whole Milk} 0.03740597 4908
## [16] {Organic Garlic} 0.03168990 4158
## [17] {Seedless Red Grapes} 0.03093538 4059
The most frequent item sets are just one-item baskets. In this dataset with minimal support value of 0.03 there are no baskets that contain at least two different items.
The next step is to recognize the most frequent rules. To obtain any rules, the support value needs to be lower in order to get item sets of at least two items.
freq_items<-eclat(trans, parameter=list(supp=0.001, maxlen=15))
freq_rules<-ruleInduction(freq_items, trans, confidence=0.3)
summary(freq_rules)
## set of 347 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 65 267 15
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 3.000 2.856 3.000 4.000
##
## summary of quality measures:
## support confidence lift itemset
## Min. :0.001006 Min. :0.3000 Min. : 2.103 Min. : 1
## 1st Qu.:0.001158 1st Qu.:0.3212 1st Qu.: 2.682 1st Qu.:1152
## Median :0.001379 Median :0.3523 Median : 3.415 Median :1993
## Mean :0.001850 Mean :0.3675 Mean : 5.734 Mean :1677
## 3rd Qu.:0.001741 3rd Qu.:0.4007 3rd Qu.: 4.355 3rd Qu.:2332
## Max. :0.018444 Max. :0.5984 Max. :80.298 Max. :2574
##
## mining info:
## data ntransactions support confidence
## trans 131209 0.001 0.3
There are 347 rules, from which 65 are of size 2 (lhs is one product and rhs is one product), 267 is of size 3 (lhs is two items) and 15 of size 4 (lhs is three items). The mean support is equal to 0.0018 and mean cofidence to 0.36. Avarage lift is equal to 5.73.
The rules with the highest lift value will be evaluated.
inspect(head(sort(freq_rules, by ="lift"),10))
## lhs rhs support confidence lift itemset
## [1] {Strawberry Rhubarb Yoghurt} => {Blueberry Yoghurt} 0.001196564 0.3096647 80.29801 37
## [2] {Blueberry Yoghurt} => {Strawberry Rhubarb Yoghurt} 0.001196564 0.3102767 80.29801 37
## [3] {Nonfat Icelandic Style Strawberry Yogurt} => {Icelandic Style Skyr Blueberry Non-fat Yogurt} 0.001166079 0.4226519 78.66062 12
## [4] {Non Fat Acai & Mixed Berries Yogurt} => {Icelandic Style Skyr Blueberry Non-fat Yogurt} 0.001288021 0.4023810 74.88795 17
## [5] {Icelandic Style Skyr Blueberry Non-fat Yogurt} => {Non Fat Raspberry Yogurt} 0.001676714 0.3120567 71.08447 67
## [6] {Non Fat Raspberry Yogurt} => {Icelandic Style Skyr Blueberry Non-fat Yogurt} 0.001676714 0.3819444 71.08447 67
## [7] {Lemon Sparkling Water} => {Grapefruit Sparkling Water} 0.001097486 0.3130435 65.19702 10
## [8] {Total 2% Lowfat Greek Strained Yogurt With Blueberry} => {Total 2% with Strawberry Lowfat Greek Strained Yogurt} 0.001783414 0.3616692 48.77108 135
## [9] {Total 2% Lowfat Greek Strained Yogurt with Peach} => {Total 2% with Strawberry Lowfat Greek Strained Yogurt} 0.001730064 0.3524845 47.53251 125
## [10] {Zero Calorie Cola} => {Soda} 0.001036514 0.3919308 34.12399 1
In the top 10 rules the lift value is really big but the support for every rule is much below 1%. It means that baskets consisting of products in the specific rule are rare cases. The confidence of 0.3-0.4 shows that those products on lhs are often bought with those on rhs. In this case support and confidence values are very important, the rules with highest lift are some rare combinations of products.
Plotting all the rules in terms of support, confidence and lift is possible.
plot(freq_rules, measure=c("support", "confidence"), shading="lift", interactive=FALSE)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
At the plot it is clearly visible that majority of the rules have a low support. As it was analyzed, the rules with highest lift value have the support much below 1%.
To get the rules that are appearing often in baskets and are bought together, the top rules will be sorted by support and confidence.
inspect(head(sort(sort(freq_rules, by ="confidence"),by="support"),15))
## lhs rhs support confidence lift itemset
## [1] {Organic Hass Avocado} => {Bag of Organic Bananas} 0.018443857 0.3318250 2.812560 2560
## [2] {Organic Raspberries} => {Bag of Organic Bananas} 0.013566143 0.3209520 2.720400 2500
## [3] {Organic Raspberries} => {Organic Strawberries} 0.012727785 0.3011179 3.626710 2501
## [4] {Honeycrisp Apple} => {Banana} 0.009381978 0.3466629 2.428991 1996
## [5] {Organic Fuji Apple} => {Banana} 0.009221928 0.3715075 2.603072 1967
## [6] {Organic Lemon} => {Bag of Organic Bananas} 0.008132064 0.3044223 2.580293 2067
## [7] {Organic Large Extra Fancy Fuji Apple} => {Bag of Organic Bananas} 0.007415650 0.3365617 2.852709 1831
## [8] {Broccoli Crown} => {Banana} 0.007049821 0.3154843 2.210530 1858
## [9] {Cucumber Kirby} => {Banana} 0.005662721 0.3079155 2.157496 1299
## [10] {Organic Navel Orange} => {Bag of Organic Bananas} 0.005525536 0.3661616 3.103598 1068
## [11] {Blueberries} => {Banana} 0.005456943 0.3082221 2.159645 1261
## [12] {Organic Hass Avocado,
## Organic Strawberries} => {Bag of Organic Bananas} 0.005411214 0.4613385 3.910321 2558
## [13] {Apple Honeycrisp Organic} => {Bag of Organic Bananas} 0.005235921 0.3050622 2.585717 1277
## [14] {Organic Kiwi} => {Bag of Organic Bananas} 0.004984414 0.3478723 2.948578 1140
## [15] {Organic Raspberries,
## Organic Strawberries} => {Bag of Organic Bananas} 0.004946307 0.3886228 3.293980 2498
There are a few interesting rules. Basically, we can assume that buying one organic product leads to buying another organic product. In the most of the rules the most frequent items are appearing. The rules can also be plotted as matrix, where we have lhs on \(x\) axis and rhs on \(y\) axis. The more red rectangle, the highest lift of the rule.
rules_for_plot = head(sort(sort(freq_rules, by ="confidence"),by="support"),15)
plot(rules_for_plot, method="matrix", measure="lift")
## Itemsets in Antecedent (LHS)
## [1] "{Organic Hass Avocado,Organic Strawberries}"
## [2] "{Organic Raspberries,Organic Strawberries}"
## [3] "{Organic Raspberries}"
## [4] "{Organic Navel Orange}"
## [5] "{Organic Kiwi}"
## [6] "{Organic Large Extra Fancy Fuji Apple}"
## [7] "{Organic Hass Avocado}"
## [8] "{Organic Fuji Apple}"
## [9] "{Apple Honeycrisp Organic}"
## [10] "{Organic Lemon}"
## [11] "{Honeycrisp Apple}"
## [12] "{Broccoli Crown}"
## [13] "{Blueberries}"
## [14] "{Cucumber Kirby}"
## Itemsets in Consequent (RHS)
## [1] "{Banana}" "{Bag of Organic Bananas}"
## [3] "{Organic Strawberries}"
The other way to plot the rules and make them more affordable to analyze is the Parallel Coordinates Plot. It show e.g. that if a client has in basket ‘Organic Strawberries’ and ‘Organic Hass Avocado’ he is likely to buy ‘Bag of Organic Bananas’.
plot(rules_for_plot, method="paracoord")
In this section, we will analyze what forces people to buy two most frequent items: ‘Banana’, ‘Bag of organic bananas’ and also a ‘Zero Calorie Cola’.
rules_banana<-apriori(data=trans, parameter=list(supp=0.0025,conf = 0.3),
appearance=list(default="lhs", rhs="Banana"), control=list(verbose=F))
inspect(sort(rules_banana, by='lift'))
## lhs rhs support confidence lift count
## [1] {Bartlett Pears} => {Banana} 0.003551586 0.3860812 2.705187 466
## [2] {Gala Apples} => {Banana} 0.002804686 0.3837331 2.688734 368
## [3] {Organic Fuji Apple} => {Banana} 0.009221928 0.3715075 2.603072 1210
## [4] {Large Lemon,Organic Avocado} => {Banana} 0.003635421 0.3535953 2.477565 477
## [5] {Organic Avocado,Organic Strawberries} => {Banana} 0.002888521 0.3483456 2.440782 379
## [6] {Honeycrisp Apple} => {Banana} 0.009381978 0.3466629 2.428991 1231
## [7] {Organic Avocado,Organic Baby Spinach} => {Banana} 0.003688771 0.3452211 2.418889 484
## [8] {Limes,Organic Avocado} => {Banana} 0.002682743 0.3394407 2.378387 352
## [9] {Large Lemon,Organic Strawberries} => {Banana} 0.002522693 0.3254671 2.280477 331
## [10] {Broccoli Crown} => {Banana} 0.007049821 0.3154843 2.210530 925
## [11] {Clementines, Bag} => {Banana} 0.003551586 0.3152909 2.209175 466
## [12] {Granny Smith Apples} => {Banana} 0.003307700 0.3147208 2.205180 434
## [13] {Blueberries} => {Banana} 0.005456943 0.3082221 2.159645 716
## [14] {Cucumber Kirby} => {Banana} 0.005662721 0.3079155 2.157496 743
is.significant(rules_banana, trans)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
is.superset(rules_banana)
## 14 x 14 sparse Matrix of class "ngCMatrix"
## [[ suppressing 14 column names '{Banana,Gala Apples}', '{Banana,Bartlett Pears}', '{Banana,Granny Smith Apples}' ... ]]
##
## {Banana,Gala Apples} | . . . . . . . . . . . . .
## {Banana,Bartlett Pears} . | . . . . . . . . . . . .
## {Banana,Granny Smith Apples} . . | . . . . . . . . . . .
## {Banana,Clementines, Bag} . . . | . . . . . . . . . .
## {Banana,Blueberries} . . . . | . . . . . . . . .
## {Banana,Cucumber Kirby} . . . . . | . . . . . . . .
## {Banana,Broccoli Crown} . . . . . . | . . . . . . .
## {Banana,Organic Fuji Apple} . . . . . . . | . . . . . .
## {Banana,Honeycrisp Apple} . . . . . . . . | . . . . .
## {Banana,Limes,Organic Avocado} . . . . . . . . . | . . . .
## {Banana,Large Lemon,Organic Avocado} . . . . . . . . . . | . . .
## {Banana,Organic Avocado,Organic Baby Spinach} . . . . . . . . . . . | . .
## {Banana,Organic Avocado,Organic Strawberries} . . . . . . . . . . . . | .
## {Banana,Large Lemon,Organic Strawberries} . . . . . . . . . . . . . |
# is.subset(rules_banana)
It is clearly visible that bananas have the high lift value in combination with the other fruits. What is important all the rules are significant (it is based of Fisher’s exact test) and rules are not supersets or subsets of each other (There is no need to plot superset and subset, the both gives the same information of two specific rules).
‘Bananas’ are the most popular item and they are mostly bought by people who generally buy fruits. The above rules can be plot as the graph.
plot(rules_banana, method="graph",control = list(cex=0.9))
rules_bag_banana<-apriori(data=trans, parameter=list(supp=0.0025,conf = 0.3),
appearance=list(default="lhs", rhs="Bag of Organic Bananas"), control=list(verbose=F))
inspect(sort(rules_bag_banana, by="lift"))
## lhs rhs support confidence lift count
## [1] {Organic Hass Avocado,Organic Raspberries} => {Bag of Organic Bananas} 0.004046978 0.5210991 4.416854 531
## [2] {Organic Hass Avocado,Organic Strawberries} => {Bag of Organic Bananas} 0.005411214 0.4613385 3.910321 710
## [3] {Organic Hass Avocado,Organic Lemon} => {Bag of Organic Bananas} 0.002690364 0.4519846 3.831037 353
## [4] {Organic Cucumber,Organic Hass Avocado} => {Bag of Organic Bananas} 0.002789443 0.4404332 3.733127 366
## [5] {Organic Cucumber,Organic Strawberries} => {Bag of Organic Bananas} 0.003231486 0.4108527 3.482401 424
## [6] {Organic Baby Spinach,Organic Hass Avocado} => {Bag of Organic Bananas} 0.003787850 0.3969649 3.364687 497
## [7] {Organic Raspberries,Organic Strawberries} => {Bag of Organic Bananas} 0.004946307 0.3886228 3.293980 649
## [8] {Organic Navel Orange} => {Bag of Organic Bananas} 0.005525536 0.3661616 3.103598 725
## [9] {Organic Baby Spinach,Organic Strawberries} => {Bag of Organic Bananas} 0.004473778 0.3581452 3.035651 587
## [10] {Organic Strawberries,Organic Whole Milk} => {Bag of Organic Bananas} 0.002583664 0.3553459 3.011924 339
## [11] {Organic Kiwi} => {Bag of Organic Bananas} 0.004984414 0.3478723 2.948578 654
## [12] {Organic Bartlett Pear} => {Bag of Organic Bananas} 0.002515071 0.3367347 2.854175 330
## [13] {Organic Large Extra Fancy Fuji Apple} => {Bag of Organic Bananas} 0.007415650 0.3365617 2.852709 973
## [14] {Organic Hass Avocado} => {Bag of Organic Bananas} 0.018443857 0.3318250 2.812560 2420
## [15] {Organic Raspberries} => {Bag of Organic Bananas} 0.013566143 0.3209520 2.720400 1780
## [16] {Frozen Organic Wild Blueberries} => {Bag of Organic Bananas} 0.002591286 0.3192488 2.705964 340
## [17] {Organic D'Anjou Pears} => {Bag of Organic Bananas} 0.004603343 0.3190703 2.704450 604
## [18] {Organic Whole Strawberries} => {Bag of Organic Bananas} 0.002941871 0.3182193 2.697237 386
## [19] {Organic Broccoli} => {Bag of Organic Bananas} 0.004001250 0.3176044 2.692025 525
## [20] {Apple Honeycrisp Organic} => {Bag of Organic Bananas} 0.005235921 0.3050622 2.585717 687
## [21] {Organic Lemon} => {Bag of Organic Bananas} 0.008132064 0.3044223 2.580293 1067
is.significant(rules_bag_banana, trans)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
is.superset(rules_bag_banana)
## 21 x 21 sparse Matrix of class "ngCMatrix"
## [[ suppressing 21 column names '{Bag of Organic Bananas,Organic Bartlett Pear}', '{Bag of Organic Bananas,Frozen Organic Wild Blueberries}', '{Bag of Organic Bananas,Organic Whole Strawberries}' ... ]]
##
## {Bag of Organic Bananas,Organic Bartlett Pear} | . . . . . . . . . . . . . . . . . . . .
## {Bag of Organic Bananas,Frozen Organic Wild Blueberries} . | . . . . . . . . . . . . . . . . . . .
## {Bag of Organic Bananas,Organic Whole Strawberries} . . | . . . . . . . . . . . . . . . . . .
## {Bag of Organic Bananas,Organic Broccoli} . . . | . . . . . . . . . . . . . . . . .
## {Bag of Organic Bananas,Organic Navel Orange} . . . . | . . . . . . . . . . . . . . . .
## {Bag of Organic Bananas,Organic Kiwi} . . . . . | . . . . . . . . . . . . . . .
## {Bag of Organic Bananas,Organic D'Anjou Pears} . . . . . . | . . . . . . . . . . . . . .
## {Apple Honeycrisp Organic,Bag of Organic Bananas} . . . . . . . | . . . . . . . . . . . . .
## {Bag of Organic Bananas,Organic Large Extra Fancy Fuji Apple} . . . . . . . . | . . . . . . . . . . . .
## {Bag of Organic Bananas,Organic Lemon} . . . . . . . . . | . . . . . . . . . . .
## {Bag of Organic Bananas,Organic Raspberries} . . . . . . . . . . | . . . . . . . . . .
## {Bag of Organic Bananas,Organic Hass Avocado} . . . . . . . . . . . | . . . . . . . . .
## {Bag of Organic Bananas,Organic Hass Avocado,Organic Lemon} . . . . . . . . . | . | | . . . . . . . .
## {Bag of Organic Bananas,Organic Strawberries,Organic Whole Milk} . . . . . . . . . . . . . | . . . . . . .
## {Bag of Organic Bananas,Organic Cucumber,Organic Hass Avocado} . . . . . . . . . . . | . . | . . . . . .
## {Bag of Organic Bananas,Organic Cucumber,Organic Strawberries} . . . . . . . . . . . . . . . | . . . . .
## {Bag of Organic Bananas,Organic Hass Avocado,Organic Raspberries} . . . . . . . . . . | | . . . . | . . . .
## {Bag of Organic Bananas,Organic Raspberries,Organic Strawberries} . . . . . . . . . . | . . . . . . | . . .
## {Bag of Organic Bananas,Organic Baby Spinach,Organic Hass Avocado} . . . . . . . . . . . | . . . . . . | . .
## {Bag of Organic Bananas,Organic Hass Avocado,Organic Strawberries} . . . . . . . . . . . | . . . . . . . | .
## {Bag of Organic Bananas,Organic Baby Spinach,Organic Strawberries} . . . . . . . . . . . . . . . . . . . . |
In this case, we are looking for products that lead to buy ‘Bag of Organic Bananas’. The most popular products are the Organic products (‘Organic Hass Avocado’, ‘Organic Raspberries’, ‘Organic Strawberries’) and generally vegetables and fruits.
All the rules are significant and some of them are superset/subset of others.
plot(rules_bag_banana, method="graph",control = list(cex=0.6))
The last product that will be analyzed is not as popular as the previous ones, but out of curiosity let’s check what leads to buying something less healthy- ‘Zero Calorie Cola’.
rules_cola<-apriori(data=trans, parameter=list(supp=0.0001,conf = 0.01),
appearance=list(default="lhs", rhs="Zero Calorie Cola"), control=list(verbose=F))
inspect(sort(rules_cola,by="lift"))
## lhs rhs support confidence lift count
## [1] {0% Greek Strained Yogurt,Soda} => {Zero Calorie Cola} 0.0001067000 0.23728814 89.724320 14
## [2] {Soda,Trail Mix} => {Zero Calorie Cola} 0.0001448071 0.21839080 82.578787 19
## [3] {Soda} => {Zero Calorie Cola} 0.0010365143 0.09024552 34.123990 136
## [4] {Trail Mix} => {Zero Calorie Cola} 0.0002515071 0.06191370 23.411049 33
## [5] {Milk Chocolate Almonds} => {Zero Calorie Cola} 0.0001143214 0.05791506 21.899069 15
## [6] {Popcorn} => {Zero Calorie Cola} 0.0001143214 0.05494505 20.776040 15
## [7] {Crunchy Oats 'n Honey Granola Bars} => {Zero Calorie Cola} 0.0001371857 0.05013928 18.958859 18
## [8] {Mineral Water} => {Zero Calorie Cola} 0.0001295643 0.04941860 18.686356 17
## [9] {Apples} => {Zero Calorie Cola} 0.0001143214 0.04870130 18.415126 15
## [10] {0% Greek Strained Yogurt} => {Zero Calorie Cola} 0.0001371857 0.04358354 16.479977 18
## [11] {Mixed Fruit Fruit Snacks} => {Zero Calorie Cola} 0.0001600500 0.04294479 16.238451 21
## [12] {Sparkling Mineral Water} => {Zero Calorie Cola} 0.0001448071 0.02769679 10.472820 19
## [13] {Sparkling Water} => {Zero Calorie Cola} 0.0001219429 0.02377415 8.989573 16
## [14] {Clementines} => {Zero Calorie Cola} 0.0002057786 0.01998520 7.556881 27
## [15] {Hass Avocados} => {Zero Calorie Cola} 0.0001752929 0.01010545 3.821112 23
is.significant(rules_cola, trans)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
is.superset(rules_cola)
## 15 x 15 sparse Matrix of class "ngCMatrix"
## [[ suppressing 15 column names '{Popcorn,Zero Calorie Cola}', '{Milk Chocolate Almonds,Zero Calorie Cola}', '{Mineral Water,Zero Calorie Cola}' ... ]]
##
## {Popcorn,Zero Calorie Cola} | . . . . . . . . . . . . . .
## {Milk Chocolate Almonds,Zero Calorie Cola} . | . . . . . . . . . . . . .
## {Mineral Water,Zero Calorie Cola} . . | . . . . . . . . . . . .
## {Apples,Zero Calorie Cola} . . . | . . . . . . . . . . .
## {Crunchy Oats 'n Honey Granola Bars,Zero Calorie Cola} . . . . | . . . . . . . . . .
## {0% Greek Strained Yogurt,Zero Calorie Cola} . . . . . | . . . . . . . . .
## {Trail Mix,Zero Calorie Cola} . . . . . . | . . . . . . . .
## {Sparkling Water,Zero Calorie Cola} . . . . . . . | . . . . . . .
## {Mixed Fruit Fruit Snacks,Zero Calorie Cola} . . . . . . . . | . . . . . .
## {Sparkling Mineral Water,Zero Calorie Cola} . . . . . . . . . | . . . . .
## {Clementines,Zero Calorie Cola} . . . . . . . . . . | . . . .
## {Soda,Zero Calorie Cola} . . . . . . . . . . . | . . .
## {Hass Avocados,Zero Calorie Cola} . . . . . . . . . . . . | . .
## {0% Greek Strained Yogurt,Soda,Zero Calorie Cola} . . . . . | . . . . . | . | .
## {Soda,Trail Mix,Zero Calorie Cola} . . . . . . | . . . . | . . |
The most frequent items bought with ‘Zero Calorie Cola’ are ‘Soda’, ‘Trail Mix’- snacks, and generally unhealthy food. The basket consisting of ‘Greek Strained Yogurt’, ‘Soda’ and ‘Zero Calorie Cola’ is not frequent but the lift value is really high.
Moreover, all the rules are significant and some of the rules are the supersets/subsets of others.
plot(rules_cola, method="graph",control = list(cex=0.7))
The basic measures (support, confidence, lift) that are connected with Market Basket Analysis were performed and results were shown on the graphs. In addition to that, there are also different measures that can be conducted to get the deep knowledge of data:
Those two measures will be calculated on the items that are more frequent ones.
This index shows how likely it is that two products will be bought togehter. It can be represented as the equation.
Jaccard coefficient (similarity): \[J(X,Y) = \frac{|X\cap Y|}{|X\cup Y|}\]
Jaccard distance (dissimilarity) is \(1-Jaccard coefficient\) :
\[ d_j(X,Y) = 1 - Jaccard\ coefficient = \frac{|X\cup Y|-|X\cap Y|}{|X\cup Y|}\]
trans.sel<-trans[,itemFrequency(trans)>0.06]
jac<-dissimilarity(trans.sel, which="items")
round(jac,digits=3)
## Bag of Organic Bananas Banana Large Lemon Organic Baby Spinach
## Banana 0.999
## Large Lemon 0.953 0.913
## Organic Baby Spinach 0.903 0.925 0.926
## Organic Strawberries 0.868 0.921 0.944 0.914
The results show that ‘Bananas’ and ‘Bag of Organic Bananas’ do not overlap in 100%, so they are maximally dissimilar. It is logical since there is no reason to buy those two products together as they are substitutes.
What is more, the dandogram can be performed on jaccard index. In this way, the most similar pairs of products are easily visible.
plot(hclust(jac, method = "ward.D2"), main = "Dendrogram for items")
On the contrary to Jaccard Index, Affinity is a measure of similarity of two items and can be representead as:
\[A(i,j) = \frac{supp(i, j)}{supp(i)+supp(j)-supp(i, j)}\]
The higher the value, the more likely two products will be bought together.
a = affinity(trans.sel)
round(a, digits=3)
## An object of class "ar_similarity"
## Bag of Organic Bananas Banana Large Lemon Organic Baby Spinach Organic Strawberries
## Bag of Organic Bananas 0.000 0.001 0.047 0.097 0.132
## Banana 0.001 0.000 0.087 0.075 0.079
## Large Lemon 0.047 0.087 0.000 0.074 0.056
## Organic Baby Spinach 0.097 0.075 0.074 0.000 0.086
## Organic Strawberries 0.132 0.079 0.056 0.086 0.000
## Slot "method":
## [1] "Affinity"
par(mar=c(4,8,4,4))
image(a, axes=FALSE)
axis(1,at=seq(0,1,l=ncol(a)),labels=rownames(a),cex.axis=0.5)
axis(2,at=seq(0,1,l=ncol(a)),labels=rownames(a),cex.axis=0.6, las=2)
The results are as expected, mostly low values. The results can be also plotted, but in comparison to matrix, plot is inverted, so the diagonal with zeros is not going from top left to bottom right but from bottom right to top left.
The red rectangles indicate that the value of affinity measure is really low. In this case ‘Bananas’ and ‘Bag of Organic Bananas’ have the lowest values, so Affinity measure is basiclly \(1-Jaccard\ Index\).
To sum up, the analysis above can be used for better placement of products. At the beginning, the strongest rules were discovered, but those baskets were rather unusual cases. The anaylysis were also performed on three products:
‘Bananas’ and ‘Bag of Organic Bananas’ are the two most frequent items. The rules show that those two products are mostly bought with other organic vegetables or fruits. This shows that those products should be placed very close to other fruits and vegetables in the shop.
‘Zero Calorie Cola’ were mostly bought with some kind of snacks. It clearly shows that the shop can offer e.g. packets of Cola and crisps in order to sell more of those products.
Market Basket Analysis is a powerful tool to get better knowledge about customers’ behaviour. It can help shops to increase cross-sell and be more profitable.