The aim of the paper is to implement the Market Basket Analysis (MBA) association rule to the case study of the grocery shop retail data. This unsupervised learning method shows the patterns in which one items (products) and linked with others. The rule “if - then” gives the clarification of the common behaviour of the subject which makes the decision, for instance, if one bought the bread, then he/she with some likelihood is going to buy also butter. This example seems to be not a “rocket science”, however, the the MBA tool gives the deeper insight into the other dependencies between the purchased products. Therefore, the MBA is a proper way to find the strongest dependencies in the product purchase behaviour.
The data comes from the https://www.kaggle.com/irfanasrullah/groceries website and has rows 9834 and columns 32. Dataset contains the transactions and purchased products for each transaction.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(arules)
## Ładowanie wymaganego pakietu: Matrix
##
## Dołączanie pakietu: 'Matrix'
## Następujące obiekty zostały zakryte z 'package:tidyr':
##
## expand, pack, unpack
##
## Dołączanie pakietu: 'arules'
## Następujący obiekt został zakryty z 'package:dplyr':
##
## recode
## Następujące obiekty zostały zakryte z 'package:base':
##
## abbreviate, write
library(arulesViz)
library(arulesCBA)
##
## Dołączanie pakietu: 'arulesCBA'
## Następujący obiekt został zakryty z 'package:arules':
##
## rules
setwd("C:/Users/Mateusz/Documents/Studia/Master/UW/DS WNE/Unsupervised Learning/projekt UL")
gro <- read.csv("gro_csv.csv", sep = ",", header=T, skip=1)
dim(gro)
## [1] 9834 32
head(gro)
## citrus.fruit semi.finished.bread margarine ready.soups
## 1 tropical fruit yogurt coffee
## 2 whole milk
## 3 pip fruit yogurt cream cheese meat spreads
## 4 other vegetables whole milk condensed milk long life bakery product
## 5 whole milk butter yogurt rice
## 6 rolls/buns
## X X.1 X.2 X.3 X.4 X.5 X.6 X.7 X.8 X.9 X.10 X.11 X.12 X.13 X.14
## 1
## 2
## 3
## 4
## 5 abrasive cleaner
## 6
## X.15 X.16 X.17 X.18 X.19 X.20 X.21 X.22 X.23 X.24 X.25 X.26 X.27
## 1
## 2
## 3
## 4
## 5
## 6
summary(gro)
## citrus.fruit semi.finished.bread margarine ready.soups
## Length:9834 Length:9834 Length:9834 Length:9834
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## X X.1 X.2 X.3
## Length:9834 Length:9834 Length:9834 Length:9834
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## X.4 X.5 X.6 X.7
## Length:9834 Length:9834 Length:9834 Length:9834
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## X.8 X.9 X.10 X.11
## Length:9834 Length:9834 Length:9834 Length:9834
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## X.12 X.13 X.14 X.15
## Length:9834 Length:9834 Length:9834 Length:9834
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## X.16 X.17 X.18 X.19
## Length:9834 Length:9834 Length:9834 Length:9834
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## X.20 X.21 X.22 X.23
## Length:9834 Length:9834 Length:9834 Length:9834
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## X.24 X.25 X.26 X.27
## Length:9834 Length:9834 Length:9834 Length:9834
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
trans <- read.transactions("gro_csv.csv", format = "basket", sep=",", skip=1)
summary(trans)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
round(itemFrequency(trans),3)
## abrasive cleaner artif. sweetener baby cosmetics
## 0.004 0.003 0.001
## baby food bags baking powder
## 0.000 0.000 0.018
## bathroom cleaner beef berries
## 0.003 0.052 0.033
## beverages bottled beer bottled water
## 0.026 0.081 0.111
## brandy brown bread butter
## 0.004 0.065 0.055
## butter milk cake bar candles
## 0.028 0.013 0.009
## candy canned beer canned fish
## 0.030 0.078 0.015
## canned fruit canned vegetables cat food
## 0.003 0.011 0.023
## cereals chewing gum chicken
## 0.006 0.021 0.043
## chocolate chocolate marshmallow citrus fruit
## 0.050 0.009 0.083
## cleaner cling film/bags cocoa drinks
## 0.005 0.011 0.002
## coffee condensed milk cooking chocolate
## 0.058 0.010 0.003
## cookware cream cream cheese
## 0.003 0.001 0.040
## curd curd cheese decalcifier
## 0.053 0.005 0.002
## dental care dessert detergent
## 0.006 0.037 0.019
## dish cleaner dishes dog food
## 0.010 0.018 0.009
## domestic eggs female sanitary products finished products
## 0.063 0.006 0.007
## fish flour flower (seeds)
## 0.003 0.017 0.010
## flower soil/fertilizer frankfurter frozen chicken
## 0.002 0.059 0.001
## frozen dessert frozen fish frozen fruits
## 0.011 0.012 0.001
## frozen meals frozen potato products frozen vegetables
## 0.028 0.008 0.048
## fruit/vegetable juice grapes hair spray
## 0.072 0.022 0.001
## ham hamburger meat hard cheese
## 0.026 0.033 0.025
## herbs honey house keeping products
## 0.016 0.002 0.008
## hygiene articles ice cream instant coffee
## 0.033 0.025 0.007
## Instant food products jam ketchup
## 0.008 0.005 0.004
## kitchen towels kitchen utensil light bulbs
## 0.006 0.000 0.004
## liqueur liquor liquor (appetizer)
## 0.001 0.011 0.008
## liver loaf long life bakery product make up remover
## 0.005 0.037 0.001
## male cosmetics margarine mayonnaise
## 0.005 0.059 0.009
## meat meat spreads misc. beverages
## 0.026 0.004 0.028
## mustard napkins newspapers
## 0.012 0.052 0.080
## nut snack nuts/prunes oil
## 0.003 0.003 0.028
## onions organic products organic sausage
## 0.031 0.002 0.002
## other vegetables packaged fruit/vegetables pasta
## 0.193 0.013 0.015
## pastry pet care photo/film
## 0.089 0.009 0.009
## pickled vegetables pip fruit popcorn
## 0.018 0.076 0.007
## pork potato products potted plants
## 0.058 0.003 0.017
## preservation products processed cheese prosecco
## 0.000 0.017 0.002
## pudding powder ready soups red/blush wine
## 0.002 0.002 0.019
## rice roll products rolls/buns
## 0.008 0.010 0.184
## root vegetables rubbing alcohol rum
## 0.109 0.001 0.004
## salad dressing salt salty snack
## 0.001 0.011 0.038
## sauces sausage seasonal products
## 0.005 0.094 0.014
## semi-finished bread shopping bags skin care
## 0.018 0.099 0.004
## sliced cheese snack products soap
## 0.025 0.003 0.003
## soda soft cheese softener
## 0.174 0.017 0.005
## sound storage medium soups sparkling wine
## 0.000 0.007 0.006
## specialty bar specialty cheese specialty chocolate
## 0.027 0.009 0.030
## specialty fat specialty vegetables spices
## 0.004 0.002 0.005
## spread cheese sugar sweet spreads
## 0.011 0.034 0.009
## syrup tea tidbits
## 0.003 0.004 0.002
## toilet cleaner tropical fruit turkey
## 0.001 0.105 0.008
## UHT-milk vinegar waffles
## 0.033 0.007 0.038
## whipped/sour cream whisky white bread
## 0.072 0.001 0.042
## white wine whole milk yogurt
## 0.019 0.256 0.140
## zwieback
## 0.007
itemFrequency(trans, type="absolute")
## abrasive cleaner artif. sweetener baby cosmetics
## 35 32 6
## baby food bags baking powder
## 1 4 174
## bathroom cleaner beef berries
## 27 516 327
## beverages bottled beer bottled water
## 256 792 1087
## brandy brown bread butter
## 41 638 545
## butter milk cake bar candles
## 275 130 88
## candy canned beer canned fish
## 294 764 148
## canned fruit canned vegetables cat food
## 32 106 229
## cereals chewing gum chicken
## 56 207 422
## chocolate chocolate marshmallow citrus fruit
## 488 89 814
## cleaner cling film/bags cocoa drinks
## 50 112 22
## coffee condensed milk cooking chocolate
## 571 101 25
## cookware cream cream cheese
## 27 13 390
## curd curd cheese decalcifier
## 524 50 15
## dental care dessert detergent
## 57 365 189
## dish cleaner dishes dog food
## 103 173 84
## domestic eggs female sanitary products finished products
## 624 60 64
## fish flour flower (seeds)
## 29 171 102
## flower soil/fertilizer frankfurter frozen chicken
## 19 580 6
## frozen dessert frozen fish frozen fruits
## 106 115 12
## frozen meals frozen potato products frozen vegetables
## 279 83 473
## fruit/vegetable juice grapes hair spray
## 711 220 11
## ham hamburger meat hard cheese
## 256 327 241
## herbs honey house keeping products
## 160 15 82
## hygiene articles ice cream instant coffee
## 324 246 73
## Instant food products jam ketchup
## 79 53 42
## kitchen towels kitchen utensil light bulbs
## 59 4 41
## liqueur liquor liquor (appetizer)
## 9 109 78
## liver loaf long life bakery product make up remover
## 50 368 8
## male cosmetics margarine mayonnaise
## 45 576 90
## meat meat spreads misc. beverages
## 254 42 279
## mustard napkins newspapers
## 118 515 785
## nut snack nuts/prunes oil
## 31 33 276
## onions organic products organic sausage
## 305 16 22
## other vegetables packaged fruit/vegetables pasta
## 1903 128 148
## pastry pet care photo/film
## 875 93 91
## pickled vegetables pip fruit popcorn
## 176 744 71
## pork potato products potted plants
## 567 28 170
## preservation products processed cheese prosecco
## 2 163 20
## pudding powder ready soups red/blush wine
## 23 18 189
## rice roll products rolls/buns
## 75 101 1809
## root vegetables rubbing alcohol rum
## 1072 10 44
## salad dressing salt salty snack
## 8 106 372
## sauces sausage seasonal products
## 54 924 140
## semi-finished bread shopping bags skin care
## 174 969 35
## sliced cheese snack products soap
## 241 30 26
## soda soft cheese softener
## 1715 168 54
## sound storage medium soups sparkling wine
## 1 67 55
## specialty bar specialty cheese specialty chocolate
## 269 84 299
## specialty fat specialty vegetables spices
## 36 17 51
## spread cheese sugar sweet spreads
## 110 333 89
## syrup tea tidbits
## 32 38 23
## toilet cleaner tropical fruit turkey
## 7 1032 80
## UHT-milk vinegar waffles
## 329 64 378
## whipped/sour cream whisky white bread
## 705 8 414
## white wine whole milk yogurt
## 187 2513 1372
## zwieback
## 68
Table above presents the frequency of the purchase of each product in the dataset.
itemFrequencyPlot(trans, topN=25, type="relative", main="Item Frequency")
Above chart presents the percentage share of a certain product frequency in all of the transactions.
rules.trans <- apriori(trans, parameter=list(supp=0.01, conf=0.25))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.25 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 98
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [171 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Understanding of the rules in Market Basket Analysis is crucial to correctly deal and interpret the data. Therefore several definitions have to be clarified:
In general, the higher the value of each index, the better.
Below the Support, Confidence, Count, and Lift are calculated. The example of the interpretation, proposed by Hahsler, M., & Karpienko, R. (2017), is as follows:
“For example, let us assume that we find the rule {milk, bread} -> {butter} with support of 0.2, confidence of 0.9 and lift of 2. Now we know that 20 % of all transactions contain all three items together, the estimated conditional probability of seeing butter in a transaction under the condition that the transaction also contains milk and bread is 0.9, and we saw the items together in transactions at double the rate we would expect under independence between the itemsets {milk, bread} and {butter}”
Therefore, with the use of the Market Basket Analysis, we can find the strongest interdependencies in the products, which are purchased. Such a analysis might help the sellers to adjust the way in which product are presented in shops (also online shops) to on the one hand make the buyers life easier, and on the other hand enhanced them to more frequently buy products which are commonly interrelated.
rules.by.supp<-sort(rules.trans, by="support", decreasing=TRUE)
inspect(head(rules.by.supp))
## lhs rhs support confidence coverage
## [1] {} => {whole milk} 0.25551601 0.2555160 1.0000000
## [2] {other vegetables} => {whole milk} 0.07483477 0.3867578 0.1934926
## [3] {whole milk} => {other vegetables} 0.07483477 0.2928770 0.2555160
## [4] {rolls/buns} => {whole milk} 0.05663447 0.3079049 0.1839349
## [5] {yogurt} => {whole milk} 0.05602440 0.4016035 0.1395018
## [6] {root vegetables} => {whole milk} 0.04890696 0.4486940 0.1089985
## lift count
## [1] 1.000000 2513
## [2] 1.513634 736
## [3] 1.513634 736
## [4] 1.205032 557
## [5] 1.571735 551
## [6] 1.756031 481
In the table above it can be seen that the highest support value is for {nothing} => {whole milk}. Numbers should be interpreted as follows: 26% of all transactions contain something and whole milk.
However, the much more interesting interpretation will be for the second highest supporting value for {other vegetables} => {whole milk}. Numbers should be interpreted as follows: 7% of all transactions contain both items together, the estimated conditional probability of seeing whole milk in a transaction under the condition that the transaction also contains other vegetables is 0.39, and we saw the items together in transactions at 1,5 times the rate we would expect under independence between the itemsets {other vegetables} and {whole milk}.
rules.by.conf<-sort(rules.trans, by="confidence", decreasing=TRUE)
inspect(head(rules.by.conf))
## lhs rhs support
## [1] {citrus fruit, root vegetables} => {other vegetables} 0.01037112
## [2] {root vegetables, tropical fruit} => {other vegetables} 0.01230300
## [3] {curd, yogurt} => {whole milk} 0.01006609
## [4] {butter, other vegetables} => {whole milk} 0.01148958
## [5] {root vegetables, tropical fruit} => {whole milk} 0.01199797
## [6] {root vegetables, yogurt} => {whole milk} 0.01453991
## confidence coverage lift count
## [1] 0.5862069 0.01769192 3.029608 102
## [2] 0.5845411 0.02104728 3.020999 121
## [3] 0.5823529 0.01728521 2.279125 99
## [4] 0.5736041 0.02003050 2.244885 113
## [5] 0.5700483 0.02104728 2.230969 118
## [6] 0.5629921 0.02582613 2.203354 143
rules.by.count<-sort(rules.trans, by="count", decreasing=TRUE)
inspect(head(rules.by.count))
## lhs rhs support confidence coverage
## [1] {} => {whole milk} 0.25551601 0.2555160 1.0000000
## [2] {other vegetables} => {whole milk} 0.07483477 0.3867578 0.1934926
## [3] {whole milk} => {other vegetables} 0.07483477 0.2928770 0.2555160
## [4] {rolls/buns} => {whole milk} 0.05663447 0.3079049 0.1839349
## [5] {yogurt} => {whole milk} 0.05602440 0.4016035 0.1395018
## [6] {root vegetables} => {whole milk} 0.04890696 0.4486940 0.1089985
## lift count
## [1] 1.000000 2513
## [2] 1.513634 736
## [3] 1.513634 736
## [4] 1.205032 557
## [5] 1.571735 551
## [6] 1.756031 481
rules.by.lift<-sort(rules.trans, by="lift", decreasing=TRUE)
inspect(head(rules.by.lift))
## lhs rhs support
## [1] {citrus fruit, other vegetables} => {root vegetables} 0.01037112
## [2] {other vegetables, tropical fruit} => {root vegetables} 0.01230300
## [3] {beef} => {root vegetables} 0.01738688
## [4] {citrus fruit, root vegetables} => {other vegetables} 0.01037112
## [5] {root vegetables, tropical fruit} => {other vegetables} 0.01230300
## [6] {other vegetables, whole milk} => {root vegetables} 0.02318251
## confidence coverage lift count
## [1] 0.3591549 0.02887646 3.295045 102
## [2] 0.3427762 0.03589222 3.144780 121
## [3] 0.3313953 0.05246568 3.040367 171
## [4] 0.5862069 0.01769192 3.029608 102
## [5] 0.5845411 0.02104728 3.020999 121
## [6] 0.3097826 0.07483477 2.842082 228
In the table above it can be seen that the highest lift ratio is for {citrus fruit, other vegetables} => {root vegetables}. Numbers should be interpreted as follows: 1% of all transactions contain all three items together, the estimated conditional probability of seeing root vegetables in a transaction under the condition that the transaction also contains citrus fruit and other vegetables is 0.36, and we saw the items together in transactions at tripled rate that we would expect under independence between the itemsets {citrus fruit, other vegetables} and {root vegetables}.
Third outcome is interesting, however, also logical. The connection between {beef} and {root vegetables} seems to be natural.
What is worth mentioning is that in general vegetables (both root vegetables and other vegetables) might be understood as a central object of purchases. Even though it was whole milk which was w most frequently chosen product. However, it is worth to remind that milk is usually used during the breakfasts or suppers (with cereals or coffee) so during the “smaller meals”, and the vegetables are usually eaten for lunches and dinners, which usually consist of many other products. Therefore, as we were looking for the most common association rules between the product purchase behaviour, I would recommend to the manager from the shop in which the data was gathered, to place vegetables centrally in the shopping area.
Eclat and Apriori algorithms are the most popular when it comes to Market Basket Analysis. There I will focus on the Eclat Algorithm.
sets<-eclat(trans, parameter = list(supp=0.05, maxlen=20))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.05 1 20 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 491
##
## create itemset ...
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [28 item(s)] done [0.02s].
## creating sparse bit matrix ... [28 row(s), 9835 column(s)] done [0.00s].
## writing ... [31 set(s)] done [0.00s].
## Creating S4 object ... done [0.01s].
rules_eclat<-ruleInduction(sets, trans, confidence=0.9)
inspect(sets)
## items support count
## [1] {whole milk, yogurt} 0.05602440 551
## [2] {rolls/buns, whole milk} 0.05663447 557
## [3] {other vegetables, whole milk} 0.07483477 736
## [4] {whole milk} 0.25551601 2513
## [5] {other vegetables} 0.19349263 1903
## [6] {rolls/buns} 0.18393493 1809
## [7] {yogurt} 0.13950178 1372
## [8] {soda} 0.17437722 1715
## [9] {root vegetables} 0.10899847 1072
## [10] {tropical fruit} 0.10493137 1032
## [11] {bottled water} 0.11052364 1087
## [12] {sausage} 0.09395018 924
## [13] {shopping bags} 0.09852567 969
## [14] {citrus fruit} 0.08276563 814
## [15] {pastry} 0.08896797 875
## [16] {pip fruit} 0.07564820 744
## [17] {whipped/sour cream} 0.07168277 705
## [18] {fruit/vegetable juice} 0.07229283 711
## [19] {domestic eggs} 0.06344687 624
## [20] {newspapers} 0.07981698 785
## [21] {butter} 0.05541434 545
## [22] {margarine} 0.05856634 576
## [23] {brown bread} 0.06487036 638
## [24] {bottled beer} 0.08052872 792
## [25] {frankfurter} 0.05897306 580
## [26] {pork} 0.05765125 567
## [27] {napkins} 0.05236401 515
## [28] {curd} 0.05327911 524
## [29] {beef} 0.05246568 516
## [30] {coffee} 0.05805796 571
## [31] {canned beer} 0.07768175 764
freq.rules <- ruleInduction(sets, trans, confidence=0.05)
freq.rules
## set of 6 rules
inspect(freq.rules)
## lhs rhs support confidence lift
## [1] {yogurt} => {whole milk} 0.05602440 0.4016035 1.571735
## [2] {whole milk} => {yogurt} 0.05602440 0.2192598 1.571735
## [3] {whole milk} => {rolls/buns} 0.05663447 0.2216474 1.205032
## [4] {rolls/buns} => {whole milk} 0.05663447 0.3079049 1.205032
## [5] {whole milk} => {other vegetables} 0.07483477 0.2928770 1.513634
## [6] {other vegetables} => {whole milk} 0.07483477 0.3867578 1.513634
## itemset
## [1] 1
## [2] 1
## [3] 2
## [4] 2
## [5] 3
## [6] 3
Here comes the Apriori Algorythm.
apr.rules<-apriori(trans, parameter=list(supp=0.001, conf=0.1, minlen=2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.1 0.1 1 none FALSE TRUE 5 0.001 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [32783 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules.by.conf<-sort(apr.rules, by="confidence", decreasing=TRUE)
inspect(head(rules.by.conf))
## lhs rhs support confidence coverage lift count
## [1] {rice,
## sugar} => {whole milk} 0.001220132 1 0.001220132 3.913649 12
## [2] {canned fish,
## hygiene articles} => {whole milk} 0.001118454 1 0.001118454 3.913649 11
## [3] {butter,
## rice,
## root vegetables} => {whole milk} 0.001016777 1 0.001016777 3.913649 10
## [4] {flour,
## root vegetables,
## whipped/sour cream} => {whole milk} 0.001728521 1 0.001728521 3.913649 17
## [5] {butter,
## domestic eggs,
## soft cheese} => {whole milk} 0.001016777 1 0.001016777 3.913649 10
## [6] {citrus fruit,
## root vegetables,
## soft cheese} => {other vegetables} 0.001016777 1 0.001016777 5.168156 10
Let’s see now what product is purchased when in the consequence whole milk is chosen, which is the most frequently purchased product.
rules.whole.milk<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.5),
appearance=list(default="lhs", rhs ="whole milk"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [2679 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules.whole.milk.byconf<-sort(rules.whole.milk, by="support", decreasing=TRUE)
inspect(head(rules.whole.milk.byconf))
## lhs rhs support
## [1] {other vegetables, yogurt} => {whole milk} 0.02226741
## [2] {tropical fruit, yogurt} => {whole milk} 0.01514997
## [3] {other vegetables, whipped/sour cream} => {whole milk} 0.01464159
## [4] {root vegetables, yogurt} => {whole milk} 0.01453991
## [5] {other vegetables, pip fruit} => {whole milk} 0.01352313
## [6] {rolls/buns, root vegetables} => {whole milk} 0.01270971
## confidence coverage lift count
## [1] 0.5128806 0.04341637 2.007235 219
## [2] 0.5173611 0.02928317 2.024770 149
## [3] 0.5070423 0.02887646 1.984385 144
## [4] 0.5629921 0.02582613 2.203354 143
## [5] 0.5175097 0.02613116 2.025351 133
## [6] 0.5230126 0.02430097 2.046888 125
The Apriori rule indicates that when the consequent is the whole milk, the most common antecedent is the choice of other vegetables and yogurt, with the support of 2%, confidence 51% and lift 2.
Let’s seen now for some other product, which is not as frequently purchased as the whole milk. For instance, let’s choose the newspapers.
rules.newspapers<-apriori(data=trans, parameter=list(supp=0.001,conf = 0.15),
appearance=list(default="lhs", rhs ="newspapers"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.15 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [183 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules.newsapers.byconf<-sort(rules.newspapers, by="support", decreasing=TRUE)
inspect(head(rules.newsapers.byconf))
## lhs rhs support confidence
## [1] {other vegetables, soda} => {newspapers} 0.004982206 0.1521739
## [2] {soda, yogurt} => {newspapers} 0.004270463 0.1561338
## [3] {rolls/buns, tropical fruit} => {newspapers} 0.004168785 0.1694215
## [4] {brown bread, whole milk} => {newspapers} 0.004067107 0.1612903
## [5] {bottled water, rolls/buns} => {newspapers} 0.003863752 0.1596639
## [6] {beef, other vegetables} => {newspapers} 0.003253686 0.1649485
## coverage lift count
## [1] 0.03274021 1.906536 49
## [2] 0.02735130 1.956148 42
## [3] 0.02460600 2.122625 41
## [4] 0.02521607 2.020752 40
## [5] 0.02419929 2.000375 38
## [6] 0.01972547 2.066583 32
The Apriori rule indicates that when the consequent is the newspapers, the most common antecedent is the choice of other vegetables and soda, with the support of 0.4%, confidence 15% and lift 1.9.
Below I have presented some visualization plots.
rules_for_plot <- head(sort(sort(apr.rules, by ="confidence"),by="support"),15)
plot(rules_for_plot, method ="matrix", measure="lift")
## Itemsets in Antecedent (LHS)
## [1] "{root vegetables}" "{other vegetables}" "{yogurt}"
## [4] "{tropical fruit}" "{whole milk}" "{rolls/buns}"
## Itemsets in Consequent (RHS)
## [1] "{rolls/buns}" "{whole milk}" "{yogurt}"
## [4] "{other vegetables}" "{root vegetables}"
This figure shows the matrix for 15 rules. On the x axis we can see the consequents (LHS) and on the y axis the antecedents (RHS). Colour indicates the lift ratio - the more reddish the higher the ratio. There are two rules that have a very high lift ratio (top left corner): * {other vegetables} and {root vegetables}, * {root vegetables} and {other vegetables}, which is logical because those two products are usually purchased together (the purchase order does not matter)
plot(rules_for_plot, method="paracoord", control=list(reorder=TRUE))
Above chart shows the complexity of rules containing a specific product. The longer the arrow, the longer the baskets are for the given product. However, given the 15 rules, all the arrows have the length of 1 interval.
Below two charts are presented. First concerns the 15 strongest rules for the whole milk buying behaviour, second similarly shows the 15 strongest rules for the newspapers purchasing.
plot(rules.whole.milk[1:15], method="graph", control = list(cex=0.9))
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
plot(rules.newspapers[1:15], method="graph", control = list(cex=0.9))
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
Below plot shows the strongest 15 rules for all of the products in the dataset with the Apriori Algorithm estimation.
plot(apr.rules[1:15], method="grouped")
plot(apr.rules[1:15], method="graph", control=list(type="items"))
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
Creme-de-la-creme are the interactive visualizations. Here you can play around a little bit with the products and associated to them rules. Have fun!
plot(apr.rules, method="graph", engine="htmlwidget")
Presented paper aimed to discover the most common product purchase behaviour. The carried out analysis, on the one hand, gives the possibility to reasonably place the products in the shopping area basing on the dependence between the product purchase behaviour which would make shoppers life easier and would save some time spend in the shop. On the other hand, the market basket analysis crates the possibility to nudge the people to buy together more products which are usually purchased together, even if buyers do not really want to buy them.
Summarizing the conducted analysis, it has to be mentioned that the association rules are very powerful tool to detect the relationship between items. In presented paper it was the purchasing behaviour, however, Market Basket Analysis might be also used in other fields. For instance, when detecting the tourism patterns of sightseeing (if the tourists saw one object, which will be the next visited place), or even in medical research - if the patient struggle with some illness, which next illness is the most likely to occur in the future.