Basket analysis is a method in the field of data mining, creating a set of approximate association rules describing it, i.e. relationships and associations between specific variable values. It allows to discover purchasing patterns by creating scenario rules about implications of buying particular item. This rules are derived from frequencies of occurence for a given pair of items. Further, the rules may be used for cross-selling and product placement.
For a market basket analysis a data set containing a list of store transaction and products bought together. Dataset is available for download here.
basket <- read.transactions("Market_Basket_Optimisation.csv", rm.duplicates= FALSE, format="basket", sep=",", skip=0)
## Warning in asMethod(object): removing duplicated items in transactions
summary(basket)
## transactions as itemMatrix in sparse format with
## 7501 rows (elements/itemsets/transactions) and
## 119 columns (items) and a density of 0.03288973
##
## most frequent items:
## mineral water eggs spaghetti french fries chocolate
## 1788 1348 1306 1282 1229
## (Other)
## 22405
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1754 1358 1044 816 667 493 391 324 259 139 102 67 40 22 17 4
## 18 19 20
## 1 2 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.914 5.000 20.000
##
## includes extended item information - examples:
## labels
## 1 almonds
## 2 antioxydant juice
## 3 asparagus
sort(itemFrequency(basket, type="absolute"))
## water spray napkins cream
## 3 5 7
## bramble tea chutney
## 14 29 31
## mashed potato chocolate bread dessert wine
## 31 32 33
## ketchup oatmeal babies food
## 33 33 34
## sandwich asparagus cauliflower
## 34 36 36
## corn salad shampoo
## 36 37 37
## hand protein bar mint green tea burger sauce
## 39 42 44
## pickles chili mayonnaise
## 45 46 46
## soda sparkling water pet food
## 47 47 49
## gluten free bar spinach shallot
## 52 53 58
## strong cheese toothpaste clothes accessories
## 58 61 63
## bacon bug spray green beans
## 65 65 65
## antioxydant juice flax seed green grapes
## 67 68 68
## blueberries salt whole weat flour
## 69 69 70
## zucchini candy bars nonfat milk
## 71 73 78
## cider barbecue sauce magazines
## 79 81 82
## body spray yams extra dark chocolate
## 86 86 90
## melons eggplant gums
## 90 99 101
## fromage blanc tomato sauce black tea
## 102 106 107
## carrots light cream pasta
## 115 117 118
## white wine mint protein bar
## 124 131 139
## rice mushroom cream sauce parmesan cheese
## 141 143 149
## almonds meatballs strawberries
## 153 157 160
## fresh tuna french wine oil
## 167 169 173
## muffins cereals vegetables mix
## 181 193 193
## ham pepper energy drink
## 199 199 200
## energy bar light mayo yogurt cake
## 203 204 205
## red wine whole wheat pasta butter
## 211 221 226
## tomato juice cottage cheese hot dogs
## 228 239 243
## avocado brownies salmon
## 250 253 319
## fresh bread champagne honey
## 323 351 356
## herb & pepper soup cooking oil
## 371 379 383
## grated cheese whole wheat rice chicken
## 393 439 450
## turkey frozen smoothie olive oil
## 469 475 494
## tomatoes shrimp low fat yogurt
## 513 536 574
## escalope cookies cake
## 595 603 608
## burgers pancakes frozen vegetables
## 654 713 715
## ground beef milk green tea
## 737 972 991
## chocolate french fries spaghetti
## 1229 1282 1306
## eggs mineral water
## 1348 1788
The data set contains 7501 transactions (rows) and 119 different products (columns). The most frequently bought items are mineral water, eggs, spaghetti, french fries and chocolate.
itemFrequencyPlot(basket, topN=15, type="absolute", main="Absolute Frequency")
itemFrequencyPlot(basket, topN=15, type="relative", main="Relative Frequency")
The frequency plot shows that mineral water undeniably dominates as most frequently bought item. Another frequently purchased products besides top 5 are green tea and milk.
image(sample(basket, 100))
The graph shows the 100 sample purchases of the products. Even having a sample of data there are visible vertical lines that show that some items are definitely more frequantly purchased than others.
An efficient and popular basket analysis tool is the Apriori algorithm. This algorithm defines how data is explored and how usefulness evaluated. The Apriori algorithm does not only show relationships between products, but thanks to its design it allows to reject insignificant data. To this end, it introduces two important concepts:
The algorithm makes it possible to determine the minimum values for these two indicators. Thanks to this,the transactions that do not meet the quality assumptions for the recommendation can be rejected. The operation of this algorithm is iterative, data is not processed all at once. As a result, this algorithm limits the number of computations on the database.
rules.basket <- apriori(basket.trans, parameter=list(supp=0.01, conf=0.05))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.05 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 75
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.01s].
## sorting and recoding items ... [75 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [403 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Rules by confidence:
rules.conf.basket <- sort(rules.basket, by="confidence", decreasing=TRUE)
inspect(head(rules.conf.basket))
## lhs rhs support confidence
## [1] {eggs,ground beef} => {mineral water} 0.01013198 0.5066667
## [2] {ground beef,milk} => {mineral water} 0.01106519 0.5030303
## [3] {chocolate,ground beef} => {mineral water} 0.01093188 0.4739884
## [4] {frozen vegetables,milk} => {mineral water} 0.01106519 0.4689266
## [5] {soup} => {mineral water} 0.02306359 0.4564644
## [6] {pancakes,spaghetti} => {mineral water} 0.01146514 0.4550265
## coverage lift count
## [1] 0.01999733 2.125563 76
## [2] 0.02199707 2.110308 83
## [3] 0.02306359 1.988472 82
## [4] 0.02359685 1.967236 83
## [5] 0.05052660 1.914955 173
## [6] 0.02519664 1.908923 86
Rules by lift:
rules.lift.basket <- sort(rules.basket, by="lift", decreasing=TRUE)
inspect(head(rules.lift.basket))
## lhs rhs support confidence
## [1] {ground beef} => {herb & pepper} 0.01599787 0.1628223
## [2] {herb & pepper} => {ground beef} 0.01599787 0.3234501
## [3] {mineral water,spaghetti} => {ground beef} 0.01706439 0.2857143
## [4] {mineral water,spaghetti} => {olive oil} 0.01026530 0.1718750
## [5] {frozen vegetables} => {tomatoes} 0.01613118 0.1692308
## [6] {tomatoes} => {frozen vegetables} 0.01613118 0.2358674
## coverage lift count
## [1] 0.09825357 3.291994 120
## [2] 0.04946007 3.291994 120
## [3] 0.05972537 2.907928 128
## [4] 0.05972537 2.609786 77
## [5] 0.09532062 2.474464 121
## [6] 0.06839088 2.474464 121
Rules by count:
rules.count.basket <- sort(rules.basket, by="count", decreasing=TRUE)
inspect(head(rules.count.basket))
## lhs rhs support confidence coverage lift count
## [1] {} => {mineral water} 0.2383682 0.2383682 1 1 1788
## [2] {} => {eggs} 0.1797094 0.1797094 1 1 1348
## [3] {} => {spaghetti} 0.1741101 0.1741101 1 1 1306
## [4] {} => {french fries} 0.1709105 0.1709105 1 1 1282
## [5] {} => {chocolate} 0.1638448 0.1638448 1 1 1229
## [6] {} => {green tea} 0.1321157 0.1321157 1 1 991
Rules by support:
rules.supp.basket <- sort(rules.basket, by="support", decreasing=TRUE)
inspect(head(rules.supp.basket))
## lhs rhs support confidence coverage lift count
## [1] {} => {mineral water} 0.2383682 0.2383682 1 1 1788
## [2] {} => {eggs} 0.1797094 0.1797094 1 1 1348
## [3] {} => {spaghetti} 0.1741101 0.1741101 1 1 1306
## [4] {} => {french fries} 0.1709105 0.1709105 1 1 1282
## [5] {} => {chocolate} 0.1638448 0.1638448 1 1 1229
## [6] {} => {green tea} 0.1321157 0.1321157 1 1 991
To assume some causality the rules need to be inspected further. By gereating rules for particular product it may be possible to determine causes or consequences of a purchase according to the data. I decided to focus on a 3 products: red wine, white wine and champagne.
Generating the rules:
rules.redwine.byconf <- sort(rules.redwine, by="confidence", decreasing=TRUE)
inspect(head(rules.redwine.byconf))
## lhs rhs support confidence
## [1] {chocolate bread} => {red wine} 0.001066524 0.2500000
## [2] {pet food} => {red wine} 0.001599787 0.2448980
## [3] {chocolate,tomato sauce} => {red wine} 0.001066524 0.2105263
## [4] {cooking oil,mineral water,spaghetti} => {red wine} 0.001066524 0.1403509
## [5] {green beans} => {red wine} 0.001199840 0.1384615
## [6] {mineral water,rice} => {red wine} 0.001066524 0.1379310
## coverage lift count
## [1] 0.004266098 8.887441 8
## [2] 0.006532462 8.706064 12
## [3] 0.005065991 7.484161 8
## [4] 0.007598987 4.989440 8
## [5] 0.008665511 4.922275 9
## [6] 0.007732302 4.903416 8
rules.whitewine.byconf <- sort(rules.whitewine, by="confidence", decreasing=TRUE)
inspect(head(rules.whitewine.byconf))
## lhs rhs support confidence coverage
## [1] {cake,spaghetti} => {white wine} 0.001066524 0.05882353 0.01813092
## [2] {pancakes,spaghetti} => {white wine} 0.001333156 0.05291005 0.02519664
## lift count
## [1] 3.558349 8
## [2] 3.200632 10
rules.champagne.byconf <- sort(rules.champagne, by="confidence", decreasing=TRUE)
inspect(head(rules.champagne.byconf))
## lhs rhs support confidence
## [1] {frozen smoothie,ground beef} => {champagne} 0.001466471 0.1929825
## [2] {ground beef,salmon} => {champagne} 0.001066524 0.1194030
## [3] {chocolate,fresh bread} => {champagne} 0.001066524 0.1126761
## [4] {cookies,green tea} => {champagne} 0.001333156 0.1111111
## [5] {chocolate,frozen smoothie} => {champagne} 0.001599787 0.1071429
## [6] {french fries,frozen vegetables} => {champagne} 0.001999733 0.1048951
## coverage lift count
## [1] 0.007598987 4.124107 11
## [2] 0.008932142 2.551686 8
## [3] 0.009465405 2.407929 8
## [4] 0.011998400 2.374486 10
## [5] 0.014931342 2.289683 12
## [6] 0.019064125 2.241647 15
From the analysis it can be deducted that for a red wine purchases, chocolate bread shows the biggest confidence. In the top 5 results there is chocolate but also tomato sauce, pasta and cooking oil which follows the logic. For the white wine purchases there are only two rules that passed the given treshold and both of them conatin spahetti. Champagne shows the most variety but chocolate, frozen smoothie and ground beef appear more than once.
plot(rules.redwine, method="graph")
plot(rules.whitewine, method="graph")
plot(rules.champagne, method="graph")
The graph for white wine clearly shows that there are only 2 rules for white wine purchase. Due to the quite high number of rules for red wine and champagne we can also inspect this dependencies on a parallel coordinates graphs below.
plot(rules.redwine, method="paracoord", control=list(reorder=TRUE))
plot(rules.whitewine, method="paracoord", control=list(reorder=TRUE))
plot(rules.champagne, method="paracoord", control=list(reorder=TRUE))
The next step will be investigating which products are possibly bought when red wine, white wine and champagne are in the basket.
rules.redwine <- apriori(data=basket, parameter=list(supp=0.001,conf = 0.05),
appearance=list(default="rhs",lhs="red wine"), control=list(verbose=F))
rules.whitewine <- apriori(data=basket, parameter=list(supp=0.001,conf = 0.05),
appearance=list(default="rhs",lhs="white wine"), control=list(verbose=F))
rules.champagne <- apriori(data=basket, parameter=list(supp=0.001,conf = 0.05),
appearance=list(default="rhs",lhs="champagne"), control=list(verbose=F))
rules.redwine.byconf <- sort(rules.redwine, by="confidence", decreasing=TRUE)
inspect(head(rules.redwine.byconf))
## lhs rhs support confidence coverage lift
## [1] {red wine} => {mineral water} 0.010931876 0.3886256 0.02812958 1.630358
## [2] {red wine} => {spaghetti} 0.010265298 0.3649289 0.02812958 2.095966
## [3] {red wine} => {eggs} 0.007065725 0.2511848 0.02812958 1.397728
## [4] {} => {mineral water} 0.238368218 0.2383682 1.00000000 1.000000
## [5] {red wine} => {french fries} 0.005332622 0.1895735 0.02812958 1.109197
## [6] {} => {eggs} 0.179709372 0.1797094 1.00000000 1.000000
## count
## [1] 82
## [2] 77
## [3] 53
## [4] 1788
## [5] 40
## [6] 1348
rules.whitewine.byconf <- sort(rules.whitewine, by="confidence", decreasing=TRUE)
inspect(head(rules.whitewine.byconf))
## lhs rhs support confidence coverage lift
## [1] {white wine} => {spaghetti} 0.004532729 0.2741935 0.01653113 1.574828
## [2] {white wine} => {mineral water} 0.004399413 0.2661290 0.01653113 1.116462
## [3] {} => {mineral water} 0.238368218 0.2383682 1.00000000 1.000000
## [4] {white wine} => {milk} 0.003466205 0.2096774 0.01653113 1.618097
## [5] {white wine} => {chocolate} 0.003466205 0.2096774 0.01653113 1.279732
## [6] {} => {eggs} 0.179709372 0.1797094 1.00000000 1.000000
## count
## [1] 34
## [2] 33
## [3] 1788
## [4] 26
## [5] 26
## [6] 1348
rules.champagne.byconf <- sort(rules.champagne, by="confidence", decreasing=TRUE)
inspect(head(rules.champagne.byconf))
## lhs rhs support confidence coverage lift
## [1] {champagne} => {chocolate} 0.011598454 0.2478632 0.04679376 1.512793
## [2] {} => {mineral water} 0.238368218 0.2383682 1.00000000 1.000000
## [3] {champagne} => {french fries} 0.009332089 0.1994302 0.04679376 1.166869
## [4] {} => {eggs} 0.179709372 0.1797094 1.00000000 1.000000
## [5] {champagne} => {green tea} 0.008265565 0.1766382 0.04679376 1.336996
## [6] {} => {spaghetti} 0.174110119 0.1741101 1.00000000 1.000000
## count
## [1] 87
## [2] 1788
## [3] 70
## [4] 1348
## [5] 62
## [6] 1306
In the case of red wine, mineral water is on the firs place, however it is the most frequantly bought item, so this relation is not significant. The second item is spaghetti. For the white wine that was corelated only with spaghetti, pancakes and cake, spaghetti shows the most confidence. When it comes to champagne, the highest value of confidence is shown by chocolate.
plot(rules.redwine, method="matrix", measure="lift")
## Itemsets in Antecedent (LHS)
## [1] "{red wine}" "{}"
## Itemsets in Consequent (RHS)
## [1] "{cookies}" "{frozen vegetables}" "{frozen smoothie}"
## [4] "{shrimp}" "{escalope}" "{chocolate}"
## [7] "{milk}" "{french fries}" "{low fat yogurt}"
## [10] "{tomatoes}" "{green tea}" "{grated cheese}"
## [13] "{olive oil}" "{pancakes}" "{whole wheat rice}"
## [16] "{eggs}" "{cake}" "{soup}"
## [19] "{chicken}" "{champagne}" "{burgers}"
## [22] "{cooking oil}" "{mineral water}" "{turkey}"
## [25] "{ground beef}" "{honey}" "{spaghetti}"
## [28] "{herb & pepper}" "{hot dogs}" "{salmon}"
## [31] "{avocado}" "{tomato juice}" "{french wine}"
## [34] "{ham}" "{rice}" "{pet food}"
plot(rules.whitewine, method="matrix", measure="lift")
## Itemsets in Antecedent (LHS)
## [1] "{white wine}" "{}"
## Itemsets in Consequent (RHS)
## [1] "{french fries}" "{low fat yogurt}" "{eggs}"
## [4] "{frozen vegetables}" "{cookies}" "{soup}"
## [7] "{whole wheat rice}" "{tomatoes}" "{escalope}"
## [10] "{burgers}" "{turkey}" "{cake}"
## [13] "{mineral water}" "{frozen smoothie}" "{green tea}"
## [16] "{chocolate}" "{cooking oil}" "{pancakes}"
## [19] "{grated cheese}" "{ground beef}" "{spaghetti}"
## [22] "{milk}" "{shrimp}" "{champagne}"
## [25] "{olive oil}" "{herb & pepper}" "{chicken}"
## [28] "{brownies}" "{pepper}" "{fresh bread}"
plot(rules.champagne, method="matrix", measure="lift")
## Itemsets in Antecedent (LHS)
## [1] "{champagne}" "{}"
## Itemsets in Consequent (RHS)
## [1] "{mineral water}" "{shrimp}" "{eggs}"
## [4] "{spaghetti}" "{tomatoes}" "{ground beef}"
## [7] "{milk}" "{low fat yogurt}" "{chicken}"
## [10] "{burgers}" "{cookies}" "{turkey}"
## [13] "{grated cheese}" "{soup}" "{cooking oil}"
## [16] "{olive oil}" "{cake}" "{frozen vegetables}"
## [19] "{escalope}" "{whole wheat rice}" "{pancakes}"
## [22] "{french fries}" "{green tea}" "{chocolate}"
## [25] "{salmon}" "{frozen smoothie}" "{fresh bread}"
Next step will be to check the significance of the particular rules. The significance was tested using Fisher exact test that determines if the assosiation between two variables is non random.
is.significant(rules.redwine, basket)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
is.significant(rules.whitewine, basket)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE
is.significant(rules.champagne, basket)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
As visible only few rules were flagged as significant, in case of a white wine and champagne only one for each was marked as significant.
The ECLAT algorithm stands for (inhale) Equivalence Class Clustering and bottom-up Lattice Traversal. It is also used as an association rules for data mining. Oppose to the Appriori algorithm, ECLAT works in a vertical manner which makes it faster, more scalable and more efficient. Using Eclat only support is counted, because we only have item-sets and their supports. While we are not creating the rules, we do not need to calculate the confidence.
items.basket <- eclat(basket, parameter=list(supp=0.005, maxlen=10))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.005 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 37
##
## create itemset ...
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [101 item(s)] done [0.00s].
## creating sparse bit matrix ... [101 row(s), 7501 column(s)] done [0.00s].
## writing ... [725 set(s)] done [0.01s].
## Creating S4 object ... done [0.00s].
inspect(head(items.basket))
## items support transIdenticalToItemsets
## [1] {mineral water,nonfat milk} 0.005065991 38
## [2] {pasta,shrimp} 0.005065991 38
## [3] {escalope,pasta} 0.005865885 44
## [4] {extra dark chocolate,mineral water} 0.005732569 43
## [5] {mineral water,mint} 0.005865885 44
## [6] {black tea,mineral water} 0.005332622 40
## count
## [1] 38
## [2] 38
## [3] 44
## [4] 43
## [5] 44
## [6] 40
The ECLAT algorithm returns the most frequently bought bundles of products. It found 725 sets of items. The minimum support for this function was set to 0.005.
freq.rules <- ruleInduction(items.basket, basket, confidence=0.1)
freq.rules
## set of 1059 rules
inspect(head(freq.rules))
## lhs rhs support confidence lift
## [1] {nonfat milk} => {mineral water} 0.005065991 0.4871795 2.043811
## [2] {pasta} => {shrimp} 0.005065991 0.3220339 4.506672
## [3] {pasta} => {escalope} 0.005865885 0.3728814 4.700812
## [4] {extra dark chocolate} => {mineral water} 0.005732569 0.4777778 2.004369
## [5] {mint} => {mineral water} 0.005865885 0.3358779 1.409072
## [6] {black tea} => {mineral water} 0.005332622 0.3738318 1.568295
## itemset
## [1] 1
## [2] 2
## [3] 3
## [4] 4
## [5] 5
## [6] 6
The highest support is shown by a pair of nonfat milk and mineral water which means that this bundle has the highest probability of appearing in a basket. The confidence, so the probability of buying mineral water if nonfat milk is already in a basket is at the level of 49%. Mineral water is the most bought item, hence it is present in many bundles itself.
The Jaccard index is the statistic used to compare sets. This coefficient measures the similarity between two sets and is defined as the quotient of the power of the intersection of the sets and the power of the sum of these sets.
trans.basket <- basket[,itemFrequency(basket)>0.1]
jac.index <- dissimilarity(trans.basket, which="items")
round(jac.index, 3)
## chocolate eggs french fries green tea milk mineral water
## eggs 0.893
## french fries 0.885 0.884
## green tea 0.914 0.911 0.896
## milk 0.877 0.889 0.914 0.928
## mineral water 0.849 0.861 0.910 0.908 0.850
## spaghetti 0.869 0.885 0.913 0.905 0.868 0.831
According to the obtained value of Jaccard index it is clearly visible that the most dissimilar item pairs are green tean and chocolate, french fries and milk, green tea and eggs, french fries and mineral water.
plot(hclust(jac.index, method="ward.D2"), main="Dendrogram for items")
Dendrogram above helps visualise the concept of dissimilarity. The pairs indicated by the Jaccard index are clearly visible on the graph.
Association rules are a powerful tool that can discover patterns and dependencies between the items of analysis that may not be obvious at first. It helps reveal true consumer behaviour and adjust different marketing strategies to achieve the best results. It may be helpful not only for maximizing the sales but also minimizing the cost, e.g decreasing the storage of not pairable items when not in season.
Another advantage of this kind analysis would be increased customer satisfaction. Cross-selling is a popular technique but it can be also combined with sales and special discounts.
The last thing that may be brought to attention is a possibility to improve the advertising and make it more appropriate which may also decrease the spending and increase satisfaction.