The paper aims to review different algorithms and methods used for market basket analysis. Data was initially analysed, two different methods - Eclut and Apriori - were used in order to identify the rules and obtained results were visualized.
First of all, necessary libraries are loaded and the database is imported.
#data loading
library(gridExtra)
library(grid)
library(ggplot2)
library(lattice)
library(arules)
data("Groceries")
Let’s have a look at the data from the Groceries database coming from the library arules.
length(Groceries)
## [1] 9835
itemFrequency(Groceries, type="absolute")
## frankfurter sausage liver loaf
## 580 924 50
## ham meat finished products
## 256 254 64
## organic sausage chicken turkey
## 22 422 80
## pork beef hamburger meat
## 567 516 327
## fish citrus fruit tropical fruit
## 29 814 1032
## pip fruit grapes berries
## 744 220 327
## nuts/prunes root vegetables onions
## 33 1072 305
## herbs other vegetables packaged fruit/vegetables
## 160 1903 128
## whole milk butter curd
## 2513 545 524
## dessert butter milk yogurt
## 365 275 1372
## whipped/sour cream beverages UHT-milk
## 705 256 329
## condensed milk cream soft cheese
## 101 13 168
## sliced cheese hard cheese cream cheese
## 241 241 390
## processed cheese spread cheese curd cheese
## 163 110 50
## specialty cheese mayonnaise salad dressing
## 84 90 8
## tidbits frozen vegetables frozen fruits
## 23 473 12
## frozen meals frozen fish frozen chicken
## 279 115 6
## ice cream frozen dessert frozen potato products
## 246 106 83
## domestic eggs rolls/buns white bread
## 624 1809 414
## brown bread pastry roll products
## 638 875 101
## semi-finished bread zwieback potato products
## 174 68 28
## flour salt rice
## 171 106 75
## pasta vinegar oil
## 148 64 276
## margarine specialty fat sugar
## 576 36 333
## artif. sweetener honey mustard
## 32 15 118
## ketchup spices soups
## 42 51 67
## ready soups Instant food products sauces
## 18 79 54
## cereals organic products baking powder
## 56 16 174
## preservation products pudding powder canned vegetables
## 2 23 106
## canned fruit pickled vegetables specialty vegetables
## 32 176 17
## jam sweet spreads meat spreads
## 53 89 42
## canned fish dog food cat food
## 148 84 229
## pet care baby food coffee
## 93 1 571
## instant coffee tea cocoa drinks
## 73 38 22
## bottled water soda misc. beverages
## 1087 1715 279
## fruit/vegetable juice syrup bottled beer
## 711 32 792
## canned beer brandy whisky
## 764 41 8
## liquor rum liqueur
## 109 44 9
## liquor (appetizer) white wine red/blush wine
## 78 187 189
## prosecco sparkling wine salty snack
## 20 55 372
## popcorn nut snack snack products
## 71 31 30
## long life bakery product waffles cake bar
## 368 378 130
## chewing gum chocolate cooking chocolate
## 207 488 25
## specialty chocolate specialty bar chocolate marshmallow
## 299 269 89
## candy seasonal products detergent
## 294 140 189
## softener decalcifier dish cleaner
## 54 15 103
## abrasive cleaner cleaner toilet cleaner
## 35 50 7
## bathroom cleaner hair spray dental care
## 27 11 57
## male cosmetics make up remover skin care
## 45 8 35
## female sanitary products baby cosmetics soap
## 60 6 26
## rubbing alcohol hygiene articles napkins
## 10 324 515
## dishes cookware kitchen utensil
## 173 27 4
## cling film/bags kitchen towels house keeping products
## 112 59 82
## candles light bulbs sound storage medium
## 88 41 1
## newspapers photo/film pot plants
## 785 91 170
## flower soil/fertilizer flower (seeds) shopping bags
## 19 102 969
## bags
## 4
Frequency of each product was displayed in the table above. This output is not very informative, therefore below we can see the percentage frequency of each product.
round(itemFrequency(Groceries),3)
## frankfurter sausage liver loaf
## 0.059 0.094 0.005
## ham meat finished products
## 0.026 0.026 0.007
## organic sausage chicken turkey
## 0.002 0.043 0.008
## pork beef hamburger meat
## 0.058 0.052 0.033
## fish citrus fruit tropical fruit
## 0.003 0.083 0.105
## pip fruit grapes berries
## 0.076 0.022 0.033
## nuts/prunes root vegetables onions
## 0.003 0.109 0.031
## herbs other vegetables packaged fruit/vegetables
## 0.016 0.193 0.013
## whole milk butter curd
## 0.256 0.055 0.053
## dessert butter milk yogurt
## 0.037 0.028 0.140
## whipped/sour cream beverages UHT-milk
## 0.072 0.026 0.033
## condensed milk cream soft cheese
## 0.010 0.001 0.017
## sliced cheese hard cheese cream cheese
## 0.025 0.025 0.040
## processed cheese spread cheese curd cheese
## 0.017 0.011 0.005
## specialty cheese mayonnaise salad dressing
## 0.009 0.009 0.001
## tidbits frozen vegetables frozen fruits
## 0.002 0.048 0.001
## frozen meals frozen fish frozen chicken
## 0.028 0.012 0.001
## ice cream frozen dessert frozen potato products
## 0.025 0.011 0.008
## domestic eggs rolls/buns white bread
## 0.063 0.184 0.042
## brown bread pastry roll products
## 0.065 0.089 0.010
## semi-finished bread zwieback potato products
## 0.018 0.007 0.003
## flour salt rice
## 0.017 0.011 0.008
## pasta vinegar oil
## 0.015 0.007 0.028
## margarine specialty fat sugar
## 0.059 0.004 0.034
## artif. sweetener honey mustard
## 0.003 0.002 0.012
## ketchup spices soups
## 0.004 0.005 0.007
## ready soups Instant food products sauces
## 0.002 0.008 0.005
## cereals organic products baking powder
## 0.006 0.002 0.018
## preservation products pudding powder canned vegetables
## 0.000 0.002 0.011
## canned fruit pickled vegetables specialty vegetables
## 0.003 0.018 0.002
## jam sweet spreads meat spreads
## 0.005 0.009 0.004
## canned fish dog food cat food
## 0.015 0.009 0.023
## pet care baby food coffee
## 0.009 0.000 0.058
## instant coffee tea cocoa drinks
## 0.007 0.004 0.002
## bottled water soda misc. beverages
## 0.111 0.174 0.028
## fruit/vegetable juice syrup bottled beer
## 0.072 0.003 0.081
## canned beer brandy whisky
## 0.078 0.004 0.001
## liquor rum liqueur
## 0.011 0.004 0.001
## liquor (appetizer) white wine red/blush wine
## 0.008 0.019 0.019
## prosecco sparkling wine salty snack
## 0.002 0.006 0.038
## popcorn nut snack snack products
## 0.007 0.003 0.003
## long life bakery product waffles cake bar
## 0.037 0.038 0.013
## chewing gum chocolate cooking chocolate
## 0.021 0.050 0.003
## specialty chocolate specialty bar chocolate marshmallow
## 0.030 0.027 0.009
## candy seasonal products detergent
## 0.030 0.014 0.019
## softener decalcifier dish cleaner
## 0.005 0.002 0.010
## abrasive cleaner cleaner toilet cleaner
## 0.004 0.005 0.001
## bathroom cleaner hair spray dental care
## 0.003 0.001 0.006
## male cosmetics make up remover skin care
## 0.005 0.001 0.004
## female sanitary products baby cosmetics soap
## 0.006 0.001 0.003
## rubbing alcohol hygiene articles napkins
## 0.001 0.033 0.052
## dishes cookware kitchen utensil
## 0.018 0.003 0.000
## cling film/bags kitchen towels house keeping products
## 0.011 0.006 0.008
## candles light bulbs sound storage medium
## 0.009 0.004 0.000
## newspapers photo/film pot plants
## 0.080 0.009 0.017
## flower soil/fertilizer flower (seeds) shopping bags
## 0.002 0.010 0.099
## bags
## 0.000
It is hard to point out the most frequent products, thus the 25 most common products were presented on the graph.
itemFrequencyPlot(Groceries, topN=25, type="relative", main="Item Frequency")
itemFrequencyPlot(Groceries, topN=25, type="absolute", main="Item Frequency")
First graph shows relative frequency of each product, while second one shows absolute values. Whole milk, other vegetables and rolls/buns are amongst the most commonly bought products.
Most association rules are based on four fundamental measures: lift, support, confidence and expected confidence. Support measures the ratio of the number of transaction containing both products to the total number of transactions in the analysed database. The higher the value, the more frequently products appear in analysed transactions. Confidence measures the ratio between number of transactions with both products to the number of transactions containing only one of the products. Expected confidence might be used as a benchmark for other measures, as it shows the percentage of transactions containing the analysed product in the whole database. Finally, lift measures the ratio between the confidence and expected confidence. In other word this measure shows whether products occur jointly more frequently than if they were independent.
Firstly, the Eclat algorithm was used to distinguish the most common sets of products.
#eclat algorithm
Groceries_eclat<-eclat(Groceries, parameter=list(supp=0.03, minlen=2, maxlen=5))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.03 2 5 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 295
##
## create itemset ...
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [44 item(s)] done [0.00s].
## creating sparse bit matrix ... [44 row(s), 9835 column(s)] done [0.00s].
## writing ... [19 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(Groceries_eclat)
## items support count
## [1] {whole milk, whipped/sour cream} 0.03223183 317
## [2] {pip fruit, whole milk} 0.03009659 296
## [3] {whole milk, pastry} 0.03324860 327
## [4] {citrus fruit, whole milk} 0.03050330 300
## [5] {sausage, rolls/buns} 0.03060498 301
## [6] {whole milk, bottled water} 0.03436706 338
## [7] {tropical fruit, whole milk} 0.04229792 416
## [8] {tropical fruit, other vegetables} 0.03589222 353
## [9] {root vegetables, whole milk} 0.04890696 481
## [10] {root vegetables, other vegetables} 0.04738180 466
## [11] {whole milk, soda} 0.04006101 394
## [12] {other vegetables, soda} 0.03274021 322
## [13] {rolls/buns, soda} 0.03833249 377
## [14] {whole milk, yogurt} 0.05602440 551
## [15] {other vegetables, yogurt} 0.04341637 427
## [16] {yogurt, rolls/buns} 0.03436706 338
## [17] {whole milk, rolls/buns} 0.05663447 557
## [18] {other vegetables, rolls/buns} 0.04260295 419
## [19] {other vegetables, whole milk} 0.07483477 736
Parameters were set to: 0.03 for support, 2 for minimum length and 5 for maximum length. This means that only sets of size 2-5 are analysed for which support is bigger than 0.03. Out of all sets of products in database, 19 meet the specified requirements. Additionally, one can try to identify rules within found sets. Below we can see 5 rules with confidence higher than 0.4.
#rules
rules_eclat<-ruleInduction(Groceries_eclat, Groceries, confidence=0.4)
rules_eclat
## set of 5 rules
inspect(rules_eclat)
## lhs rhs support confidence lift
## [1] {whipped/sour cream} => {whole milk} 0.03223183 0.4496454 1.759754
## [2] {tropical fruit} => {whole milk} 0.04229792 0.4031008 1.577595
## [3] {root vegetables} => {whole milk} 0.04890696 0.4486940 1.756031
## [4] {root vegetables} => {other vegetables} 0.04738180 0.4347015 2.246605
## [5] {yogurt} => {whole milk} 0.05602440 0.4016035 1.571735
## itemset
## [1] 1
## [2] 7
## [3] 9
## [4] 10
## [5] 14
The highest support is observed for pair yoghurt and whole milk, which means that both of these products were present in around 6% of all transactions. All five rules have the lift values above 1, and pair root vegetable and other vegetables have lift equal to 2.25. This means that root vegetables are bought with other vegetables 2.25 times more often than if these products were fully independent.
Second algorithm used to determine rules was the Apriori algorithm. In order to assess the rules more conveniently, they were sorted by lift and by support.
#apriori
Groceries_apriori<-apriori(Groceries, parameter=list(supp=0.03, conf=0.4))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.03 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 295
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [44 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [5 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules_apriori_lift<-sort(Groceries_apriori, by="lift", decreasing=TRUE)
inspect(rules_apriori_lift)
## lhs rhs support confidence coverage
## [1] {root vegetables} => {other vegetables} 0.04738180 0.4347015 0.10899847
## [2] {whipped/sour cream} => {whole milk} 0.03223183 0.4496454 0.07168277
## [3] {root vegetables} => {whole milk} 0.04890696 0.4486940 0.10899847
## [4] {tropical fruit} => {whole milk} 0.04229792 0.4031008 0.10493137
## [5] {yogurt} => {whole milk} 0.05602440 0.4016035 0.13950178
## lift count
## [1] 2.246605 466
## [2] 1.759754 317
## [3] 1.756031 481
## [4] 1.577595 416
## [5] 1.571735 551
rules_apriori_support<-sort(Groceries_apriori, by="support", decreasing=TRUE)
inspect(rules_apriori_support)
## lhs rhs support confidence coverage
## [1] {yogurt} => {whole milk} 0.05602440 0.4016035 0.13950178
## [2] {root vegetables} => {whole milk} 0.04890696 0.4486940 0.10899847
## [3] {root vegetables} => {other vegetables} 0.04738180 0.4347015 0.10899847
## [4] {tropical fruit} => {whole milk} 0.04229792 0.4031008 0.10493137
## [5] {whipped/sour cream} => {whole milk} 0.03223183 0.4496454 0.07168277
## lift count
## [1] 1.571735 551
## [2] 1.756031 481
## [3] 2.246605 466
## [4] 1.577595 416
## [5] 1.759754 317
Again root vegetables with other vegetables have the highest lift value, while yoghurt and whole milk have the highest support. Results for both algorithms are pretty much in line with intuition: yoghurt and whipped cream are often bought with whole milk, root vegetables are often bought with other vegetables. It is worth mentioning that 4 out of 5 rules contain whole milk, which was earlier determined as the most frequent product in whole database.
library(arulesViz)
plot(rules_eclat, method="graph", control=list(type="items"))
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
plot(rules_eclat, method="paracoord", control=list(reorder=TRUE))
plot(rules_eclat, method="graph", engine="htmlwidget")
As only 5 rules were identified, the plots are not that impressive and it is hard to utilize their potential. The most informative and also interactive is last plot. One can choose the product or the rule and inspect its interaction with other items alone.
The analysis of the Groceries dataset from library arules was performed. Used thresholds (support 0.03 and confidence 0.4) allowed for extraction of 5 rules. Whole milk appeared within analysed rules most frequently. However, this was most likely caused by the fact, that this product had the highest frequency out of all products - it appeared in over 25% of all transactions.