Market Basket Analysis

Aleksandra Tomczak

1st March 2021

1. Introduction

Basket analysis is a method in the field of data mining, creating a set of approximate association rules describing it, i.e. relationships and associations between specific variable values. It allows to discover purchasing patterns by creating scenario rules about implications of buying particular item. This rules are derived from frequencies of occurence for a given pair of items. Further, the rules may be used for cross-selling and product placement.

2. Database

For a market basket analysis a data set containing a list of store transaction and products bought together. Dataset is available for download here.

basket <- read.transactions("Market_Basket_Optimisation.csv", rm.duplicates= FALSE, format="basket", sep=",", skip=0)
## Warning in asMethod(object): removing duplicated items in transactions
summary(basket)
## transactions as itemMatrix in sparse format with
##  7501 rows (elements/itemsets/transactions) and
##  119 columns (items) and a density of 0.03288973 
## 
## most frequent items:
## mineral water          eggs     spaghetti  french fries     chocolate 
##          1788          1348          1306          1282          1229 
##       (Other) 
##         22405 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1754 1358 1044  816  667  493  391  324  259  139  102   67   40   22   17    4 
##   18   19   20 
##    1    2    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.914   5.000  20.000 
## 
## includes extended item information - examples:
##              labels
## 1           almonds
## 2 antioxydant juice
## 3         asparagus
sort(itemFrequency(basket, type="absolute"))
##          water spray              napkins                cream 
##                    3                    5                    7 
##              bramble                  tea              chutney 
##                   14                   29                   31 
##        mashed potato      chocolate bread         dessert wine 
##                   31                   32                   33 
##              ketchup              oatmeal          babies food 
##                   33                   33                   34 
##             sandwich            asparagus          cauliflower 
##                   34                   36                   36 
##                 corn                salad              shampoo 
##                   36                   37                   37 
##     hand protein bar       mint green tea         burger sauce 
##                   39                   42                   44 
##              pickles                chili           mayonnaise 
##                   45                   46                   46 
##                 soda      sparkling water             pet food 
##                   47                   47                   49 
##      gluten free bar              spinach              shallot 
##                   52                   53                   58 
##        strong cheese           toothpaste  clothes accessories 
##                   58                   61                   63 
##                bacon            bug spray          green beans 
##                   65                   65                   65 
##    antioxydant juice            flax seed         green grapes 
##                   67                   68                   68 
##          blueberries                 salt     whole weat flour 
##                   69                   69                   70 
##             zucchini           candy bars          nonfat milk 
##                   71                   73                   78 
##                cider       barbecue sauce            magazines 
##                   79                   81                   82 
##           body spray                 yams extra dark chocolate 
##                   86                   86                   90 
##               melons             eggplant                 gums 
##                   90                   99                  101 
##        fromage blanc         tomato sauce            black tea 
##                  102                  106                  107 
##              carrots          light cream                pasta 
##                  115                  117                  118 
##           white wine                 mint          protein bar 
##                  124                  131                  139 
##                 rice mushroom cream sauce      parmesan cheese 
##                  141                  143                  149 
##              almonds            meatballs         strawberries 
##                  153                  157                  160 
##           fresh tuna          french wine                  oil 
##                  167                  169                  173 
##              muffins              cereals       vegetables mix 
##                  181                  193                  193 
##                  ham               pepper         energy drink 
##                  199                  199                  200 
##           energy bar           light mayo          yogurt cake 
##                  203                  204                  205 
##             red wine    whole wheat pasta               butter 
##                  211                  221                  226 
##         tomato juice       cottage cheese             hot dogs 
##                  228                  239                  243 
##              avocado             brownies               salmon 
##                  250                  253                  319 
##          fresh bread            champagne                honey 
##                  323                  351                  356 
##        herb & pepper                 soup          cooking oil 
##                  371                  379                  383 
##        grated cheese     whole wheat rice              chicken 
##                  393                  439                  450 
##               turkey      frozen smoothie            olive oil 
##                  469                  475                  494 
##             tomatoes               shrimp       low fat yogurt 
##                  513                  536                  574 
##             escalope              cookies                 cake 
##                  595                  603                  608 
##              burgers             pancakes    frozen vegetables 
##                  654                  713                  715 
##          ground beef                 milk            green tea 
##                  737                  972                  991 
##            chocolate         french fries            spaghetti 
##                 1229                 1282                 1306 
##                 eggs        mineral water 
##                 1348                 1788

The data set contains 7501 transactions (rows) and 119 different products (columns). The most frequently bought items are mineral water, eggs, spaghetti, french fries and chocolate.

Plotting the frequency of the top 15 products:

Absolute
itemFrequencyPlot(basket, topN=15, type="absolute", main="Absolute Frequency")

Relative
itemFrequencyPlot(basket, topN=15, type="relative", main="Relative Frequency")

The frequency plot shows that mineral water undeniably dominates as most frequently bought item. Another frequently purchased products besides top 5 are green tea and milk.

image(sample(basket, 100))

The graph shows the 100 sample purchases of the products. Even having a sample of data there are visible vertical lines that show that some items are definitely more frequantly purchased than others.

3. Assosiation Rules

3.1. Apriori Algorithm

An efficient and popular basket analysis tool is the Apriori algorithm. This algorithm defines how data is explored and how usefulness evaluated. The Apriori algorithm does not only show relationships between products, but thanks to its design it allows to reject insignificant data. To this end, it introduces two important concepts:

  • support - frequency of occurrence
  • confidence - certainty of the rule

The algorithm makes it possible to determine the minimum values for these two indicators. Thanks to this,the transactions that do not meet the quality assumptions for the recommendation can be rejected. The operation of this algorithm is iterative, data is not processed all at once. As a result, this algorithm limits the number of computations on the database.

rules.basket <- apriori(basket.trans, parameter=list(supp=0.01, conf=0.05))   
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.05    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 75 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.01s].
## sorting and recoding items ... [75 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [403 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Rules by confidence:

rules.conf.basket <- sort(rules.basket, by="confidence", decreasing=TRUE)
inspect(head(rules.conf.basket))
##     lhs                         rhs             support    confidence
## [1] {eggs,ground beef}       => {mineral water} 0.01013198 0.5066667 
## [2] {ground beef,milk}       => {mineral water} 0.01106519 0.5030303 
## [3] {chocolate,ground beef}  => {mineral water} 0.01093188 0.4739884 
## [4] {frozen vegetables,milk} => {mineral water} 0.01106519 0.4689266 
## [5] {soup}                   => {mineral water} 0.02306359 0.4564644 
## [6] {pancakes,spaghetti}     => {mineral water} 0.01146514 0.4550265 
##     coverage   lift     count
## [1] 0.01999733 2.125563  76  
## [2] 0.02199707 2.110308  83  
## [3] 0.02306359 1.988472  82  
## [4] 0.02359685 1.967236  83  
## [5] 0.05052660 1.914955 173  
## [6] 0.02519664 1.908923  86

Rules by lift:

rules.lift.basket <- sort(rules.basket, by="lift", decreasing=TRUE)
inspect(head(rules.lift.basket))  
##     lhs                          rhs                 support    confidence
## [1] {ground beef}             => {herb & pepper}     0.01599787 0.1628223 
## [2] {herb & pepper}           => {ground beef}       0.01599787 0.3234501 
## [3] {mineral water,spaghetti} => {ground beef}       0.01706439 0.2857143 
## [4] {mineral water,spaghetti} => {olive oil}         0.01026530 0.1718750 
## [5] {frozen vegetables}       => {tomatoes}          0.01613118 0.1692308 
## [6] {tomatoes}                => {frozen vegetables} 0.01613118 0.2358674 
##     coverage   lift     count
## [1] 0.09825357 3.291994 120  
## [2] 0.04946007 3.291994 120  
## [3] 0.05972537 2.907928 128  
## [4] 0.05972537 2.609786  77  
## [5] 0.09532062 2.474464 121  
## [6] 0.06839088 2.474464 121

Rules by count:

rules.count.basket <- sort(rules.basket, by="count", decreasing=TRUE)
inspect(head(rules.count.basket))  
##     lhs    rhs             support   confidence coverage lift count
## [1] {}  => {mineral water} 0.2383682 0.2383682  1        1    1788 
## [2] {}  => {eggs}          0.1797094 0.1797094  1        1    1348 
## [3] {}  => {spaghetti}     0.1741101 0.1741101  1        1    1306 
## [4] {}  => {french fries}  0.1709105 0.1709105  1        1    1282 
## [5] {}  => {chocolate}     0.1638448 0.1638448  1        1    1229 
## [6] {}  => {green tea}     0.1321157 0.1321157  1        1     991

Rules by support:

rules.supp.basket <- sort(rules.basket, by="support", decreasing=TRUE)
inspect(head(rules.supp.basket))    
##     lhs    rhs             support   confidence coverage lift count
## [1] {}  => {mineral water} 0.2383682 0.2383682  1        1    1788 
## [2] {}  => {eggs}          0.1797094 0.1797094  1        1    1348 
## [3] {}  => {spaghetti}     0.1741101 0.1741101  1        1    1306 
## [4] {}  => {french fries}  0.1709105 0.1709105  1        1    1282 
## [5] {}  => {chocolate}     0.1638448 0.1638448  1        1    1229 
## [6] {}  => {green tea}     0.1321157 0.1321157  1        1     991

To assume some causality the rules need to be inspected further. By gereating rules for particular product it may be possible to determine causes or consequences of a purchase according to the data. I decided to focus on a 3 products: red wine, white wine and champagne.

Generating the rules:

rules.redwine.byconf <- sort(rules.redwine, by="confidence", decreasing=TRUE)
inspect(head(rules.redwine.byconf))
##     lhs                                      rhs        support     confidence
## [1] {chocolate bread}                     => {red wine} 0.001066524 0.2500000 
## [2] {pet food}                            => {red wine} 0.001599787 0.2448980 
## [3] {chocolate,tomato sauce}              => {red wine} 0.001066524 0.2105263 
## [4] {cooking oil,mineral water,spaghetti} => {red wine} 0.001066524 0.1403509 
## [5] {green beans}                         => {red wine} 0.001199840 0.1384615 
## [6] {mineral water,rice}                  => {red wine} 0.001066524 0.1379310 
##     coverage    lift     count
## [1] 0.004266098 8.887441  8   
## [2] 0.006532462 8.706064 12   
## [3] 0.005065991 7.484161  8   
## [4] 0.007598987 4.989440  8   
## [5] 0.008665511 4.922275  9   
## [6] 0.007732302 4.903416  8
rules.whitewine.byconf <- sort(rules.whitewine, by="confidence", decreasing=TRUE)
inspect(head(rules.whitewine.byconf))
##     lhs                     rhs          support     confidence coverage  
## [1] {cake,spaghetti}     => {white wine} 0.001066524 0.05882353 0.01813092
## [2] {pancakes,spaghetti} => {white wine} 0.001333156 0.05291005 0.02519664
##     lift     count
## [1] 3.558349  8   
## [2] 3.200632 10
rules.champagne.byconf <- sort(rules.champagne, by="confidence", decreasing=TRUE)
inspect(head(rules.champagne.byconf))
##     lhs                                 rhs         support     confidence
## [1] {frozen smoothie,ground beef}    => {champagne} 0.001466471 0.1929825 
## [2] {ground beef,salmon}             => {champagne} 0.001066524 0.1194030 
## [3] {chocolate,fresh bread}          => {champagne} 0.001066524 0.1126761 
## [4] {cookies,green tea}              => {champagne} 0.001333156 0.1111111 
## [5] {chocolate,frozen smoothie}      => {champagne} 0.001599787 0.1071429 
## [6] {french fries,frozen vegetables} => {champagne} 0.001999733 0.1048951 
##     coverage    lift     count
## [1] 0.007598987 4.124107 11   
## [2] 0.008932142 2.551686  8   
## [3] 0.009465405 2.407929  8   
## [4] 0.011998400 2.374486 10   
## [5] 0.014931342 2.289683 12   
## [6] 0.019064125 2.241647 15

From the analysis it can be deducted that for a red wine purchases, chocolate bread shows the biggest confidence. In the top 5 results there is chocolate but also tomato sauce, pasta and cooking oil which follows the logic. For the white wine purchases there are only two rules that passed the given treshold and both of them conatin spahetti. Champagne shows the most variety but chocolate, frozen smoothie and ground beef appear more than once.

Graphic Representation:

Red Wine
plot(rules.redwine, method="graph")

White Wine
plot(rules.whitewine, method="graph")

Champagne
plot(rules.champagne, method="graph")

The graph for white wine clearly shows that there are only 2 rules for white wine purchase. Due to the quite high number of rules for red wine and champagne we can also inspect this dependencies on a parallel coordinates graphs below.

Parallel Coordinates:

Red Wine
plot(rules.redwine, method="paracoord", control=list(reorder=TRUE))

White Wine
plot(rules.whitewine, method="paracoord", control=list(reorder=TRUE))

Champagne
plot(rules.champagne, method="paracoord", control=list(reorder=TRUE))

The next step will be investigating which products are possibly bought when red wine, white wine and champagne are in the basket.

rules.redwine <- apriori(data=basket, parameter=list(supp=0.001,conf = 0.05), 
appearance=list(default="rhs",lhs="red wine"), control=list(verbose=F)) 

rules.whitewine <- apriori(data=basket, parameter=list(supp=0.001,conf = 0.05), 
appearance=list(default="rhs",lhs="white wine"), control=list(verbose=F))

rules.champagne <- apriori(data=basket, parameter=list(supp=0.001,conf = 0.05), 
appearance=list(default="rhs",lhs="champagne"), control=list(verbose=F))
rules.redwine.byconf <- sort(rules.redwine, by="confidence", decreasing=TRUE)
inspect(head(rules.redwine.byconf))
##     lhs           rhs             support     confidence coverage   lift    
## [1] {red wine} => {mineral water} 0.010931876 0.3886256  0.02812958 1.630358
## [2] {red wine} => {spaghetti}     0.010265298 0.3649289  0.02812958 2.095966
## [3] {red wine} => {eggs}          0.007065725 0.2511848  0.02812958 1.397728
## [4] {}         => {mineral water} 0.238368218 0.2383682  1.00000000 1.000000
## [5] {red wine} => {french fries}  0.005332622 0.1895735  0.02812958 1.109197
## [6] {}         => {eggs}          0.179709372 0.1797094  1.00000000 1.000000
##     count
## [1]   82 
## [2]   77 
## [3]   53 
## [4] 1788 
## [5]   40 
## [6] 1348
rules.whitewine.byconf <- sort(rules.whitewine, by="confidence", decreasing=TRUE)
inspect(head(rules.whitewine.byconf))
##     lhs             rhs             support     confidence coverage   lift    
## [1] {white wine} => {spaghetti}     0.004532729 0.2741935  0.01653113 1.574828
## [2] {white wine} => {mineral water} 0.004399413 0.2661290  0.01653113 1.116462
## [3] {}           => {mineral water} 0.238368218 0.2383682  1.00000000 1.000000
## [4] {white wine} => {milk}          0.003466205 0.2096774  0.01653113 1.618097
## [5] {white wine} => {chocolate}     0.003466205 0.2096774  0.01653113 1.279732
## [6] {}           => {eggs}          0.179709372 0.1797094  1.00000000 1.000000
##     count
## [1]   34 
## [2]   33 
## [3] 1788 
## [4]   26 
## [5]   26 
## [6] 1348
rules.champagne.byconf <- sort(rules.champagne, by="confidence", decreasing=TRUE)
inspect(head(rules.champagne.byconf))
##     lhs            rhs             support     confidence coverage   lift    
## [1] {champagne} => {chocolate}     0.011598454 0.2478632  0.04679376 1.512793
## [2] {}          => {mineral water} 0.238368218 0.2383682  1.00000000 1.000000
## [3] {champagne} => {french fries}  0.009332089 0.1994302  0.04679376 1.166869
## [4] {}          => {eggs}          0.179709372 0.1797094  1.00000000 1.000000
## [5] {champagne} => {green tea}     0.008265565 0.1766382  0.04679376 1.336996
## [6] {}          => {spaghetti}     0.174110119 0.1741101  1.00000000 1.000000
##     count
## [1]   87 
## [2] 1788 
## [3]   70 
## [4] 1348 
## [5]   62 
## [6] 1306

In the case of red wine, mineral water is on the firs place, however it is the most frequantly bought item, so this relation is not significant. The second item is spaghetti. For the white wine that was corelated only with spaghetti, pancakes and cake, spaghetti shows the most confidence. When it comes to champagne, the highest value of confidence is shown by chocolate.

Strength of the rule by lift:

Red Wine
plot(rules.redwine, method="matrix", measure="lift")
## Itemsets in Antecedent (LHS)
## [1] "{red wine}" "{}"        
## Itemsets in Consequent (RHS)
##  [1] "{cookies}"           "{frozen vegetables}" "{frozen smoothie}"  
##  [4] "{shrimp}"            "{escalope}"          "{chocolate}"        
##  [7] "{milk}"              "{french fries}"      "{low fat yogurt}"   
## [10] "{tomatoes}"          "{green tea}"         "{grated cheese}"    
## [13] "{olive oil}"         "{pancakes}"          "{whole wheat rice}" 
## [16] "{eggs}"              "{cake}"              "{soup}"             
## [19] "{chicken}"           "{champagne}"         "{burgers}"          
## [22] "{cooking oil}"       "{mineral water}"     "{turkey}"           
## [25] "{ground beef}"       "{honey}"             "{spaghetti}"        
## [28] "{herb & pepper}"     "{hot dogs}"          "{salmon}"           
## [31] "{avocado}"           "{tomato juice}"      "{french wine}"      
## [34] "{ham}"               "{rice}"              "{pet food}"

White Wine
plot(rules.whitewine, method="matrix", measure="lift")
## Itemsets in Antecedent (LHS)
## [1] "{white wine}" "{}"          
## Itemsets in Consequent (RHS)
##  [1] "{french fries}"      "{low fat yogurt}"    "{eggs}"             
##  [4] "{frozen vegetables}" "{cookies}"           "{soup}"             
##  [7] "{whole wheat rice}"  "{tomatoes}"          "{escalope}"         
## [10] "{burgers}"           "{turkey}"            "{cake}"             
## [13] "{mineral water}"     "{frozen smoothie}"   "{green tea}"        
## [16] "{chocolate}"         "{cooking oil}"       "{pancakes}"         
## [19] "{grated cheese}"     "{ground beef}"       "{spaghetti}"        
## [22] "{milk}"              "{shrimp}"            "{champagne}"        
## [25] "{olive oil}"         "{herb & pepper}"     "{chicken}"          
## [28] "{brownies}"          "{pepper}"            "{fresh bread}"

Champagne
plot(rules.champagne, method="matrix", measure="lift")
## Itemsets in Antecedent (LHS)
## [1] "{champagne}" "{}"         
## Itemsets in Consequent (RHS)
##  [1] "{mineral water}"     "{shrimp}"            "{eggs}"             
##  [4] "{spaghetti}"         "{tomatoes}"          "{ground beef}"      
##  [7] "{milk}"              "{low fat yogurt}"    "{chicken}"          
## [10] "{burgers}"           "{cookies}"           "{turkey}"           
## [13] "{grated cheese}"     "{soup}"              "{cooking oil}"      
## [16] "{olive oil}"         "{cake}"              "{frozen vegetables}"
## [19] "{escalope}"          "{whole wheat rice}"  "{pancakes}"         
## [22] "{french fries}"      "{green tea}"         "{chocolate}"        
## [25] "{salmon}"            "{frozen smoothie}"   "{fresh bread}"

Next step will be to check the significance of the particular rules. The significance was tested using Fisher exact test that determines if the assosiation between two variables is non random.

is.significant(rules.redwine, basket)
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
is.significant(rules.whitewine, basket)
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE
is.significant(rules.champagne, basket)
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE

As visible only few rules were flagged as significant, in case of a white wine and champagne only one for each was marked as significant.

3.2. ECLAT

The ECLAT algorithm stands for (inhale) Equivalence Class Clustering and bottom-up Lattice Traversal. It is also used as an association rules for data mining. Oppose to the Appriori algorithm, ECLAT works in a vertical manner which makes it faster, more scalable and more efficient. Using Eclat only support is counted, because we only have item-sets and their supports. While we are not creating the rules, we do not need to calculate the confidence.

items.basket <- eclat(basket, parameter=list(supp=0.005, maxlen=10)) 
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE   0.005      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 37 
## 
## create itemset ... 
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.00s].
## sorting and recoding items ... [101 item(s)] done [0.00s].
## creating sparse bit matrix ... [101 row(s), 7501 column(s)] done [0.00s].
## writing  ... [725 set(s)] done [0.01s].
## Creating S4 object  ... done [0.00s].
inspect(head(items.basket))
##     items                                support     transIdenticalToItemsets
## [1] {mineral water,nonfat milk}          0.005065991 38                      
## [2] {pasta,shrimp}                       0.005065991 38                      
## [3] {escalope,pasta}                     0.005865885 44                      
## [4] {extra dark chocolate,mineral water} 0.005732569 43                      
## [5] {mineral water,mint}                 0.005865885 44                      
## [6] {black tea,mineral water}            0.005332622 40                      
##     count
## [1] 38   
## [2] 38   
## [3] 44   
## [4] 43   
## [5] 44   
## [6] 40

The ECLAT algorithm returns the most frequently bought bundles of products. It found 725 sets of items. The minimum support for this function was set to 0.005.

freq.rules <- ruleInduction(items.basket, basket, confidence=0.1)
freq.rules
## set of 1059 rules
inspect(head(freq.rules))
##     lhs                       rhs             support     confidence lift    
## [1] {nonfat milk}          => {mineral water} 0.005065991 0.4871795  2.043811
## [2] {pasta}                => {shrimp}        0.005065991 0.3220339  4.506672
## [3] {pasta}                => {escalope}      0.005865885 0.3728814  4.700812
## [4] {extra dark chocolate} => {mineral water} 0.005732569 0.4777778  2.004369
## [5] {mint}                 => {mineral water} 0.005865885 0.3358779  1.409072
## [6] {black tea}            => {mineral water} 0.005332622 0.3738318  1.568295
##     itemset
## [1] 1      
## [2] 2      
## [3] 3      
## [4] 4      
## [5] 5      
## [6] 6

The highest support is shown by a pair of nonfat milk and mineral water which means that this bundle has the highest probability of appearing in a basket. The confidence, so the probability of buying mineral water if nonfat milk is already in a basket is at the level of 49%. Mineral water is the most bought item, hence it is present in many bundles itself.

3.3. Similarity/Dissimilarity

The Jaccard index is the statistic used to compare sets. This coefficient measures the similarity between two sets and is defined as the quotient of the power of the intersection of the sets and the power of the sum of these sets.

trans.basket <- basket[,itemFrequency(basket)>0.1]
jac.index <- dissimilarity(trans.basket, which="items")
round(jac.index, 3)
##               chocolate  eggs french fries green tea  milk mineral water
## eggs              0.893                                                 
## french fries      0.885 0.884                                           
## green tea         0.914 0.911        0.896                              
## milk              0.877 0.889        0.914     0.928                    
## mineral water     0.849 0.861        0.910     0.908 0.850              
## spaghetti         0.869 0.885        0.913     0.905 0.868         0.831

According to the obtained value of Jaccard index it is clearly visible that the most dissimilar item pairs are green tean and chocolate, french fries and milk, green tea and eggs, french fries and mineral water.

plot(hclust(jac.index, method="ward.D2"), main="Dendrogram for items")

Dendrogram above helps visualise the concept of dissimilarity. The pairs indicated by the Jaccard index are clearly visible on the graph.

4. Conclusions

Association rules are a powerful tool that can discover patterns and dependencies between the items of analysis that may not be obvious at first. It helps reveal true consumer behaviour and adjust different marketing strategies to achieve the best results. It may be helpful not only for maximizing the sales but also minimizing the cost, e.g decreasing the storage of not pairable items when not in season.

Another advantage of this kind analysis would be increased customer satisfaction. Cross-selling is a popular technique but it can be also combined with sales and special discounts.

The last thing that may be brought to attention is a possibility to improve the advertising and make it more appropriate which may also decrease the spending and increase satisfaction.