Market Basket analysis
Data
Rules
- Top rules
- Induction
Dissimilarity measures
- Jaccard Index
- Affinity measure
Conclusions

Market Basket analysis

Market Basket Anaylsis is one of the assosciation rules method. This kind of method is basicly a rule of “if” and “then”. In simple words, if client has a product in a basket, this method specifies what are the other products that client is most likely to buy. E.g. if client buys a beer then they is likely (with probability of 90%) that they will buy a bag of crisps.

Among all the orders we are looking for the most frequent rules (“if” and “then”). Those rules are based on indicators:

support is the number of how often a specific set of products appears in all the orders;
confidence indicates the strength of the rule, how much two sides of the rule are linked;
expected confidence is confidence divided by number of transactions;
lift is confidence divided by expected confidence, it is the indicator of how strong the items are linked.

Data

This project is basing on instacart dataset from kaggle competition https://www.kaggle.com/c/instacart-market-basket-analysis/data. It consists of 131209 orders and 39123 different products.

trans<-read.transactions("trans.csv", format = "single", sep=",",cols = c("order_id","product_name"))
summary(trans)

## transactions as itemMatrix in sparse format with
##  131209 rows (elements/itemsets/transactions) and
##  39123 columns (items) and a density of 0.0002697329 
## 
## most frequent items:
##                 Banana Bag of Organic Bananas   Organic Strawberries 
##                  18726                  15480                  10894 
##   Organic Baby Spinach            Large Lemon                (Other) 
##                   9784                   8135                1321598 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 6845 7368 8033 8218 8895 8708 8541 7983 7217 6553 6034 5383 4843 4394 3831 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
## 3522 3108 2719 2473 2102 1857 1681 1462 1292 1079  986  860  679  634  553 
##   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45 
##  446  403  346  315  280  210  193  178  142   99   90   88   75   79   64 
##   46   47   48   49   50   51   52   53   54   55   56   57   58   59   60 
##   48   49   32   26   31   24   23   18   15   12   10    6    5    4    8 
##   61   62   63   64   65   66   67   68   70   72   74   75   76   77   80 
##    3    3    5    4    3    2    1    2    4    2    2    1    2    1    2 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    5.00    9.00   10.55   14.00   80.00 
## 
## includes extended item information - examples:
##                         labels
## 1            #2 Coffee Filters
## 2 #2 Cone White Coffee Filters
## 3        #2 Mechanical Pencils
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2        100000
## 3       1000008

From the summary of the dataset, the basic informations can be gained. Firstly the most frequent items are:

Banana
Bag of Organic Bananas
Organic Strawberries
Organic baby Spinach
Large Lemon

What is more, the distribution of basket size can be analyzed. In addtiion to that the plot of distribution can be useful.

The most frequent basket is of size 5 and the mean size is equal to 10.6.

group_basket = df %>% group_by(., order_id) %>% summarise(basket_size=n())
basket_sizes = group_basket %>% group_by(.,basket_size) %>% summarise(count=n())

ggplot(basket_sizes, aes(x=basket_size, y=count)) + geom_bar(stat = "identity") + scale_x_continuous(breaks = seq(0, 80, by = 5))

As it was said ‘Bananas’ are the product that customers bought most times. It is possible to plot topN most frequent items.

itemFrequencyPlot(trans,topN=20,type="absolute")

We know what are the most popular products, but trying to know all the products that were rarely bought is hard.

item_freq <- as.data.frame(itemFrequency(trans,type="absolute"), cols = 'product')
colnames(item_freq) <- 'number_of_purchases'
item_freq %>% group_by(.,number_of_purchases) %>% summarise(number_of_products = n()) %>% head(.,5)

## # A tibble: 5 x 2
##   number_of_purchases number_of_products
##                 <int>              <int>
## 1                   1               7884
## 2                   2               4910
## 3                   3               3291
## 4                   4               2441
## 5                   5               1815

The table shows that there are 7884 products that were bought only once, 4910 products that were bought twice and so on.

Rules

The rules are based on the support and confidence level, so we have to define the level of those statistics. We need to do so to be able to analyze most frequent rules/patterns.

Firstly, by using eclat algorithm the most frequent item sets will be shown. The default support is 0.1 but in this dataset a lower value is required to obtain any results.

freq_items<-eclat(trans, parameter=list(supp=0.03, maxlen=15))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target   ext
##     FALSE    0.03      1     15 frequent itemsets FALSE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 3936 
## 
## create itemset ... 
## set transactions ...[39123 item(s), 131209 transaction(s)] done [1.23s].
## sorting and recoding items ... [17 item(s)] done [0.01s].
## creating sparse bit matrix ... [17 row(s), 131209 column(s)] done [0.02s].
## writing  ... [17 set(s)] done [0.01s].
## Creating S4 object  ... done [0.00s].

inspect(freq_items)

##      items                    support    count
## [1]  {Banana}                 0.14271887 18726
## [2]  {Bag of Organic Bananas} 0.11797971 15480
## [3]  {Organic Strawberries}   0.08302784 10894
## [4]  {Organic Baby Spinach}   0.07456806  9784
## [5]  {Large Lemon}            0.06200032  8135
## [6]  {Organic Hass Avocado}   0.05558308  7293
## [7]  {Organic Avocado}        0.05646716  7409
## [8]  {Limes}                  0.04598008  6033
## [9]  {Organic Raspberries}    0.04226844  5546
## [10] {Strawberries}           0.04949356  6494
## [11] {Organic Cucumber}       0.03515765  4613
## [12] {Organic Zucchini}       0.03497473  4589
## [13] {Organic Blueberries}    0.03784801  4966
## [14] {Organic Yellow Onion}   0.03269593  4290
## [15] {Organic Whole Milk}     0.03740597  4908
## [16] {Organic Garlic}         0.03168990  4158
## [17] {Seedless Red Grapes}    0.03093538  4059

The most frequent item sets are just one-item baskets. In this dataset with minimal support value of 0.03 there are no baskets that contain at least two different items.

The next step is to recognize the most frequent rules. To obtain any rules, the support value needs to be lower in order to get item sets of at least two items.

freq_items<-eclat(trans, parameter=list(supp=0.001, maxlen=15))

freq_rules<-ruleInduction(freq_items, trans, confidence=0.3)
summary(freq_rules)

## set of 347 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4 
##  65 267  15 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   2.856   3.000   4.000 
## 
## summary of quality measures:
##     support           confidence          lift           itemset    
##  Min.   :0.001006   Min.   :0.3000   Min.   : 2.103   Min.   :   1  
##  1st Qu.:0.001158   1st Qu.:0.3212   1st Qu.: 2.682   1st Qu.:1152  
##  Median :0.001379   Median :0.3523   Median : 3.415   Median :1993  
##  Mean   :0.001850   Mean   :0.3675   Mean   : 5.734   Mean   :1677  
##  3rd Qu.:0.001741   3rd Qu.:0.4007   3rd Qu.: 4.355   3rd Qu.:2332  
##  Max.   :0.018444   Max.   :0.5984   Max.   :80.298   Max.   :2574  
## 
## mining info:
##   data ntransactions support confidence
##  trans        131209   0.001        0.3

There are 347 rules, from which 65 are of size 2 (lhs is one product and rhs is one product), 267 is of size 3 (lhs is two items) and 15 of size 4 (lhs is three items). The mean support is equal to 0.0018 and mean cofidence to 0.36. Avarage lift is equal to 5.73.

Top rules

The rules with the highest lift value will be evaluated.

inspect(head(sort(freq_rules, by ="lift"),10))

##      lhs                                                       rhs                                                         support confidence     lift itemset
## [1]  {Strawberry Rhubarb Yoghurt}                           => {Blueberry Yoghurt}                                     0.001196564  0.3096647 80.29801      37
## [2]  {Blueberry Yoghurt}                                    => {Strawberry Rhubarb Yoghurt}                            0.001196564  0.3102767 80.29801      37
## [3]  {Nonfat Icelandic Style Strawberry Yogurt}             => {Icelandic Style Skyr Blueberry Non-fat Yogurt}         0.001166079  0.4226519 78.66062      12
## [4]  {Non Fat Acai & Mixed Berries Yogurt}                  => {Icelandic Style Skyr Blueberry Non-fat Yogurt}         0.001288021  0.4023810 74.88795      17
## [5]  {Icelandic Style Skyr Blueberry Non-fat Yogurt}        => {Non Fat Raspberry Yogurt}                              0.001676714  0.3120567 71.08447      67
## [6]  {Non Fat Raspberry Yogurt}                             => {Icelandic Style Skyr Blueberry Non-fat Yogurt}         0.001676714  0.3819444 71.08447      67
## [7]  {Lemon Sparkling Water}                                => {Grapefruit Sparkling Water}                            0.001097486  0.3130435 65.19702      10
## [8]  {Total 2% Lowfat Greek Strained Yogurt With Blueberry} => {Total 2% with Strawberry Lowfat Greek Strained Yogurt} 0.001783414  0.3616692 48.77108     135
## [9]  {Total 2% Lowfat Greek Strained Yogurt with Peach}     => {Total 2% with Strawberry Lowfat Greek Strained Yogurt} 0.001730064  0.3524845 47.53251     125
## [10] {Zero Calorie Cola}                                    => {Soda}                                                  0.001036514  0.3919308 34.12399       1

In the top 10 rules the lift value is really big but the support for every rule is much below 1%. It means that baskets consisting of products in the specific rule are rare cases. The confidence of 0.3-0.4 shows that those products on lhs are often bought with those on rhs. In this case support and confidence values are very important, the rules with highest lift are some rare combinations of products.

Plotting all the rules in terms of support, confidence and lift is possible.

plot(freq_rules, measure=c("support", "confidence"), shading="lift", interactive=FALSE)

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

At the plot it is clearly visible that majority of the rules have a low support. As it was analyzed, the rules with highest lift value have the support much below 1%.

To get the rules that are appearing often in baskets and are bought together, the top rules will be sorted by support and confidence.

inspect(head(sort(sort(freq_rules, by ="confidence"),by="support"),15))

##      lhs                                       rhs                          support confidence     lift itemset
## [1]  {Organic Hass Avocado}                 => {Bag of Organic Bananas} 0.018443857  0.3318250 2.812560    2560
## [2]  {Organic Raspberries}                  => {Bag of Organic Bananas} 0.013566143  0.3209520 2.720400    2500
## [3]  {Organic Raspberries}                  => {Organic Strawberries}   0.012727785  0.3011179 3.626710    2501
## [4]  {Honeycrisp Apple}                     => {Banana}                 0.009381978  0.3466629 2.428991    1996
## [5]  {Organic Fuji Apple}                   => {Banana}                 0.009221928  0.3715075 2.603072    1967
## [6]  {Organic Lemon}                        => {Bag of Organic Bananas} 0.008132064  0.3044223 2.580293    2067
## [7]  {Organic Large Extra Fancy Fuji Apple} => {Bag of Organic Bananas} 0.007415650  0.3365617 2.852709    1831
## [8]  {Broccoli Crown}                       => {Banana}                 0.007049821  0.3154843 2.210530    1858
## [9]  {Cucumber Kirby}                       => {Banana}                 0.005662721  0.3079155 2.157496    1299
## [10] {Organic Navel Orange}                 => {Bag of Organic Bananas} 0.005525536  0.3661616 3.103598    1068
## [11] {Blueberries}                          => {Banana}                 0.005456943  0.3082221 2.159645    1261
## [12] {Organic Hass Avocado,                                                                                    
##       Organic Strawberries}                 => {Bag of Organic Bananas} 0.005411214  0.4613385 3.910321    2558
## [13] {Apple Honeycrisp Organic}             => {Bag of Organic Bananas} 0.005235921  0.3050622 2.585717    1277
## [14] {Organic Kiwi}                         => {Bag of Organic Bananas} 0.004984414  0.3478723 2.948578    1140
## [15] {Organic Raspberries,                                                                                     
##       Organic Strawberries}                 => {Bag of Organic Bananas} 0.004946307  0.3886228 3.293980    2498

There are a few interesting rules. Basically, we can assume that buying one organic product leads to buying another organic product. In the most of the rules the most frequent items are appearing. The rules can also be plotted as matrix, where we have lhs on \(x\) axis and rhs on \(y\) axis. The more red rectangle, the highest lift of the rule.

rules_for_plot = head(sort(sort(freq_rules, by ="confidence"),by="support"),15)
plot(rules_for_plot, method="matrix", measure="lift")

## Itemsets in Antecedent (LHS)
##  [1] "{Organic Hass Avocado,Organic Strawberries}"
##  [2] "{Organic Raspberries,Organic Strawberries}" 
##  [3] "{Organic Raspberries}"                      
##  [4] "{Organic Navel Orange}"                     
##  [5] "{Organic Kiwi}"                             
##  [6] "{Organic Large Extra Fancy Fuji Apple}"     
##  [7] "{Organic Hass Avocado}"                     
##  [8] "{Organic Fuji Apple}"                       
##  [9] "{Apple Honeycrisp Organic}"                 
## [10] "{Organic Lemon}"                            
## [11] "{Honeycrisp Apple}"                         
## [12] "{Broccoli Crown}"                           
## [13] "{Blueberries}"                              
## [14] "{Cucumber Kirby}"                           
## Itemsets in Consequent (RHS)
## [1] "{Banana}"                 "{Bag of Organic Bananas}"
## [3] "{Organic Strawberries}"

The other way to plot the rules and make them more affordable to analyze is the Parallel Coordinates Plot. It show e.g. that if a client has in basket ‘Organic Strawberries’ and ‘Organic Hass Avocado’ he is likely to buy ‘Bag of Organic Bananas’.

plot(rules_for_plot, method="paracoord")

Induction

In this section, we will analyze what forces people to buy two most frequent items: ‘Banana’, ‘Bag of organic bananas’ and also a ‘Zero Calorie Cola’.

Bananas rules

rules_banana<-apriori(data=trans, parameter=list(supp=0.0025,conf = 0.3), 
appearance=list(default="lhs", rhs="Banana"), control=list(verbose=F)) 
inspect(sort(rules_banana, by='lift'))

##      lhs                                       rhs      support     confidence lift     count
## [1]  {Bartlett Pears}                       => {Banana} 0.003551586 0.3860812  2.705187  466 
## [2]  {Gala Apples}                          => {Banana} 0.002804686 0.3837331  2.688734  368 
## [3]  {Organic Fuji Apple}                   => {Banana} 0.009221928 0.3715075  2.603072 1210 
## [4]  {Large Lemon,Organic Avocado}          => {Banana} 0.003635421 0.3535953  2.477565  477 
## [5]  {Organic Avocado,Organic Strawberries} => {Banana} 0.002888521 0.3483456  2.440782  379 
## [6]  {Honeycrisp Apple}                     => {Banana} 0.009381978 0.3466629  2.428991 1231 
## [7]  {Organic Avocado,Organic Baby Spinach} => {Banana} 0.003688771 0.3452211  2.418889  484 
## [8]  {Limes,Organic Avocado}                => {Banana} 0.002682743 0.3394407  2.378387  352 
## [9]  {Large Lemon,Organic Strawberries}     => {Banana} 0.002522693 0.3254671  2.280477  331 
## [10] {Broccoli Crown}                       => {Banana} 0.007049821 0.3154843  2.210530  925 
## [11] {Clementines, Bag}                     => {Banana} 0.003551586 0.3152909  2.209175  466 
## [12] {Granny Smith Apples}                  => {Banana} 0.003307700 0.3147208  2.205180  434 
## [13] {Blueberries}                          => {Banana} 0.005456943 0.3082221  2.159645  716 
## [14] {Cucumber Kirby}                       => {Banana} 0.005662721 0.3079155  2.157496  743

is.significant(rules_banana, trans)

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

is.superset(rules_banana)

## 14 x 14 sparse Matrix of class "ngCMatrix"

##    [[ suppressing 14 column names '{Banana,Gala Apples}', '{Banana,Bartlett Pears}', '{Banana,Granny Smith Apples}' ... ]]

##                                                                          
## {Banana,Gala Apples}                          | . . . . . . . . . . . . .
## {Banana,Bartlett Pears}                       . | . . . . . . . . . . . .
## {Banana,Granny Smith Apples}                  . . | . . . . . . . . . . .
## {Banana,Clementines, Bag}                     . . . | . . . . . . . . . .
## {Banana,Blueberries}                          . . . . | . . . . . . . . .
## {Banana,Cucumber Kirby}                       . . . . . | . . . . . . . .
## {Banana,Broccoli Crown}                       . . . . . . | . . . . . . .
## {Banana,Organic Fuji Apple}                   . . . . . . . | . . . . . .
## {Banana,Honeycrisp Apple}                     . . . . . . . . | . . . . .
## {Banana,Limes,Organic Avocado}                . . . . . . . . . | . . . .
## {Banana,Large Lemon,Organic Avocado}          . . . . . . . . . . | . . .
## {Banana,Organic Avocado,Organic Baby Spinach} . . . . . . . . . . . | . .
## {Banana,Organic Avocado,Organic Strawberries} . . . . . . . . . . . . | .
## {Banana,Large Lemon,Organic Strawberries}     . . . . . . . . . . . . . |

# is.subset(rules_banana)

It is clearly visible that bananas have the high lift value in combination with the other fruits. What is important all the rules are significant (it is based of Fisher’s exact test) and rules are not supersets or subsets of each other (There is no need to plot superset and subset, the both gives the same information of two specific rules).

‘Bananas’ are the most popular item and they are mostly bought by people who generally buy fruits. The above rules can be plot as the graph.

plot(rules_banana, method="graph",control = list(cex=0.9))

Bag of Organic Bananas rules

rules_bag_banana<-apriori(data=trans, parameter=list(supp=0.0025,conf = 0.3), 
appearance=list(default="lhs", rhs="Bag of Organic Bananas"), control=list(verbose=F)) 
inspect(sort(rules_bag_banana, by="lift"))

##      lhs                                            rhs                      support     confidence lift     count
## [1]  {Organic Hass Avocado,Organic Raspberries}  => {Bag of Organic Bananas} 0.004046978 0.5210991  4.416854  531 
## [2]  {Organic Hass Avocado,Organic Strawberries} => {Bag of Organic Bananas} 0.005411214 0.4613385  3.910321  710 
## [3]  {Organic Hass Avocado,Organic Lemon}        => {Bag of Organic Bananas} 0.002690364 0.4519846  3.831037  353 
## [4]  {Organic Cucumber,Organic Hass Avocado}     => {Bag of Organic Bananas} 0.002789443 0.4404332  3.733127  366 
## [5]  {Organic Cucumber,Organic Strawberries}     => {Bag of Organic Bananas} 0.003231486 0.4108527  3.482401  424 
## [6]  {Organic Baby Spinach,Organic Hass Avocado} => {Bag of Organic Bananas} 0.003787850 0.3969649  3.364687  497 
## [7]  {Organic Raspberries,Organic Strawberries}  => {Bag of Organic Bananas} 0.004946307 0.3886228  3.293980  649 
## [8]  {Organic Navel Orange}                      => {Bag of Organic Bananas} 0.005525536 0.3661616  3.103598  725 
## [9]  {Organic Baby Spinach,Organic Strawberries} => {Bag of Organic Bananas} 0.004473778 0.3581452  3.035651  587 
## [10] {Organic Strawberries,Organic Whole Milk}   => {Bag of Organic Bananas} 0.002583664 0.3553459  3.011924  339 
## [11] {Organic Kiwi}                              => {Bag of Organic Bananas} 0.004984414 0.3478723  2.948578  654 
## [12] {Organic Bartlett Pear}                     => {Bag of Organic Bananas} 0.002515071 0.3367347  2.854175  330 
## [13] {Organic Large Extra Fancy Fuji Apple}      => {Bag of Organic Bananas} 0.007415650 0.3365617  2.852709  973 
## [14] {Organic Hass Avocado}                      => {Bag of Organic Bananas} 0.018443857 0.3318250  2.812560 2420 
## [15] {Organic Raspberries}                       => {Bag of Organic Bananas} 0.013566143 0.3209520  2.720400 1780 
## [16] {Frozen Organic Wild Blueberries}           => {Bag of Organic Bananas} 0.002591286 0.3192488  2.705964  340 
## [17] {Organic D'Anjou Pears}                     => {Bag of Organic Bananas} 0.004603343 0.3190703  2.704450  604 
## [18] {Organic Whole Strawberries}                => {Bag of Organic Bananas} 0.002941871 0.3182193  2.697237  386 
## [19] {Organic Broccoli}                          => {Bag of Organic Bananas} 0.004001250 0.3176044  2.692025  525 
## [20] {Apple Honeycrisp Organic}                  => {Bag of Organic Bananas} 0.005235921 0.3050622  2.585717  687 
## [21] {Organic Lemon}                             => {Bag of Organic Bananas} 0.008132064 0.3044223  2.580293 1067

is.significant(rules_bag_banana, trans)

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

is.superset(rules_bag_banana)

## 21 x 21 sparse Matrix of class "ngCMatrix"

##    [[ suppressing 21 column names '{Bag of Organic Bananas,Organic Bartlett Pear}', '{Bag of Organic Bananas,Frozen Organic Wild Blueberries}', '{Bag of Organic Bananas,Organic Whole Strawberries}' ... ]]

##                                                                                                             
## {Bag of Organic Bananas,Organic Bartlett Pear}                     | . . . . . . . . . . . . . . . . . . . .
## {Bag of Organic Bananas,Frozen Organic Wild Blueberries}           . | . . . . . . . . . . . . . . . . . . .
## {Bag of Organic Bananas,Organic Whole Strawberries}                . . | . . . . . . . . . . . . . . . . . .
## {Bag of Organic Bananas,Organic Broccoli}                          . . . | . . . . . . . . . . . . . . . . .
## {Bag of Organic Bananas,Organic Navel Orange}                      . . . . | . . . . . . . . . . . . . . . .
## {Bag of Organic Bananas,Organic Kiwi}                              . . . . . | . . . . . . . . . . . . . . .
## {Bag of Organic Bananas,Organic D'Anjou Pears}                     . . . . . . | . . . . . . . . . . . . . .
## {Apple Honeycrisp Organic,Bag of Organic Bananas}                  . . . . . . . | . . . . . . . . . . . . .
## {Bag of Organic Bananas,Organic Large Extra Fancy Fuji Apple}      . . . . . . . . | . . . . . . . . . . . .
## {Bag of Organic Bananas,Organic Lemon}                             . . . . . . . . . | . . . . . . . . . . .
## {Bag of Organic Bananas,Organic Raspberries}                       . . . . . . . . . . | . . . . . . . . . .
## {Bag of Organic Bananas,Organic Hass Avocado}                      . . . . . . . . . . . | . . . . . . . . .
## {Bag of Organic Bananas,Organic Hass Avocado,Organic Lemon}        . . . . . . . . . | . | | . . . . . . . .
## {Bag of Organic Bananas,Organic Strawberries,Organic Whole Milk}   . . . . . . . . . . . . . | . . . . . . .
## {Bag of Organic Bananas,Organic Cucumber,Organic Hass Avocado}     . . . . . . . . . . . | . . | . . . . . .
## {Bag of Organic Bananas,Organic Cucumber,Organic Strawberries}     . . . . . . . . . . . . . . . | . . . . .
## {Bag of Organic Bananas,Organic Hass Avocado,Organic Raspberries}  . . . . . . . . . . | | . . . . | . . . .
## {Bag of Organic Bananas,Organic Raspberries,Organic Strawberries}  . . . . . . . . . . | . . . . . . | . . .
## {Bag of Organic Bananas,Organic Baby Spinach,Organic Hass Avocado} . . . . . . . . . . . | . . . . . . | . .
## {Bag of Organic Bananas,Organic Hass Avocado,Organic Strawberries} . . . . . . . . . . . | . . . . . . . | .
## {Bag of Organic Bananas,Organic Baby Spinach,Organic Strawberries} . . . . . . . . . . . . . . . . . . . . |

In this case, we are looking for products that lead to buy ‘Bag of Organic Bananas’. The most popular products are the Organic products (‘Organic Hass Avocado’, ‘Organic Raspberries’, ‘Organic Strawberries’) and generally vegetables and fruits.

All the rules are significant and some of them are superset/subset of others.

plot(rules_bag_banana, method="graph",control = list(cex=0.6))

Zero Calorie Cola rules

The last product that will be analyzed is not as popular as the previous ones, but out of curiosity let’s check what leads to buying something less healthy- ‘Zero Calorie Cola’.

rules_cola<-apriori(data=trans, parameter=list(supp=0.0001,conf = 0.01), 
appearance=list(default="lhs", rhs="Zero Calorie Cola"), control=list(verbose=F)) 
inspect(sort(rules_cola,by="lift"))

##      lhs                                     rhs                 support      confidence lift      count
## [1]  {0% Greek Strained Yogurt,Soda}      => {Zero Calorie Cola} 0.0001067000 0.23728814 89.724320  14  
## [2]  {Soda,Trail Mix}                     => {Zero Calorie Cola} 0.0001448071 0.21839080 82.578787  19  
## [3]  {Soda}                               => {Zero Calorie Cola} 0.0010365143 0.09024552 34.123990 136  
## [4]  {Trail Mix}                          => {Zero Calorie Cola} 0.0002515071 0.06191370 23.411049  33  
## [5]  {Milk Chocolate Almonds}             => {Zero Calorie Cola} 0.0001143214 0.05791506 21.899069  15  
## [6]  {Popcorn}                            => {Zero Calorie Cola} 0.0001143214 0.05494505 20.776040  15  
## [7]  {Crunchy Oats 'n Honey Granola Bars} => {Zero Calorie Cola} 0.0001371857 0.05013928 18.958859  18  
## [8]  {Mineral Water}                      => {Zero Calorie Cola} 0.0001295643 0.04941860 18.686356  17  
## [9]  {Apples}                             => {Zero Calorie Cola} 0.0001143214 0.04870130 18.415126  15  
## [10] {0% Greek Strained Yogurt}           => {Zero Calorie Cola} 0.0001371857 0.04358354 16.479977  18  
## [11] {Mixed Fruit Fruit Snacks}           => {Zero Calorie Cola} 0.0001600500 0.04294479 16.238451  21  
## [12] {Sparkling Mineral Water}            => {Zero Calorie Cola} 0.0001448071 0.02769679 10.472820  19  
## [13] {Sparkling Water}                    => {Zero Calorie Cola} 0.0001219429 0.02377415  8.989573  16  
## [14] {Clementines}                        => {Zero Calorie Cola} 0.0002057786 0.01998520  7.556881  27  
## [15] {Hass Avocados}                      => {Zero Calorie Cola} 0.0001752929 0.01010545  3.821112  23

is.significant(rules_cola, trans)

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

is.superset(rules_cola)

## 15 x 15 sparse Matrix of class "ngCMatrix"

##    [[ suppressing 15 column names '{Popcorn,Zero Calorie Cola}', '{Milk Chocolate Almonds,Zero Calorie Cola}', '{Mineral Water,Zero Calorie Cola}' ... ]]

##                                                                                     
## {Popcorn,Zero Calorie Cola}                            | . . . . . . . . . . . . . .
## {Milk Chocolate Almonds,Zero Calorie Cola}             . | . . . . . . . . . . . . .
## {Mineral Water,Zero Calorie Cola}                      . . | . . . . . . . . . . . .
## {Apples,Zero Calorie Cola}                             . . . | . . . . . . . . . . .
## {Crunchy Oats 'n Honey Granola Bars,Zero Calorie Cola} . . . . | . . . . . . . . . .
## {0% Greek Strained Yogurt,Zero Calorie Cola}           . . . . . | . . . . . . . . .
## {Trail Mix,Zero Calorie Cola}                          . . . . . . | . . . . . . . .
## {Sparkling Water,Zero Calorie Cola}                    . . . . . . . | . . . . . . .
## {Mixed Fruit Fruit Snacks,Zero Calorie Cola}           . . . . . . . . | . . . . . .
## {Sparkling Mineral Water,Zero Calorie Cola}            . . . . . . . . . | . . . . .
## {Clementines,Zero Calorie Cola}                        . . . . . . . . . . | . . . .
## {Soda,Zero Calorie Cola}                               . . . . . . . . . . . | . . .
## {Hass Avocados,Zero Calorie Cola}                      . . . . . . . . . . . . | . .
## {0% Greek Strained Yogurt,Soda,Zero Calorie Cola}      . . . . . | . . . . . | . | .
## {Soda,Trail Mix,Zero Calorie Cola}                     . . . . . . | . . . . | . . |

The most frequent items bought with ‘Zero Calorie Cola’ are ‘Soda’, ‘Trail Mix’- snacks, and generally unhealthy food. The basket consisting of ‘Greek Strained Yogurt’, ‘Soda’ and ‘Zero Calorie Cola’ is not frequent but the lift value is really high.

Moreover, all the rules are significant and some of the rules are the supersets/subsets of others.

plot(rules_cola, method="graph",control = list(cex=0.7))

Dissimilarity measures

The basic measures (support, confidence, lift) that are connected with Market Basket Analysis were performed and results were shown on the graphs. In addition to that, there are also different measures that can be conducted to get the deep knowledge of data:

Jaccard Index,
Affinity measure.

Those two measures will be calculated on the items that are more frequent ones.

Jaccard Index

This index shows how likely it is that two products will be bought togehter. It can be represented as the equation.

Jaccard coefficient (similarity): \[J(X,Y) = \frac{|X\cap Y|}{|X\cup Y|}\]

Jaccard distance (dissimilarity) is \(1-Jaccard coefficient\) :

\[ d_j(X,Y) = 1 - Jaccard\ coefficient = \frac{|X\cup Y|-|X\cap Y|}{|X\cup Y|}\]

trans.sel<-trans[,itemFrequency(trans)>0.06]
jac<-dissimilarity(trans.sel, which="items") 
round(jac,digits=3)

##                      Bag of Organic Bananas Banana Large Lemon Organic Baby Spinach
## Banana                                0.999                                        
## Large Lemon                           0.953  0.913                                 
## Organic Baby Spinach                  0.903  0.925       0.926                     
## Organic Strawberries                  0.868  0.921       0.944                0.914

The results show that ‘Bananas’ and ‘Bag of Organic Bananas’ do not overlap in 100%, so they are maximally dissimilar. It is logical since there is no reason to buy those two products together as they are substitutes.

What is more, the dandogram can be performed on jaccard index. In this way, the most similar pairs of products are easily visible.

plot(hclust(jac, method = "ward.D2"), main = "Dendrogram for items")

Affinity measure

On the contrary to Jaccard Index, Affinity is a measure of similarity of two items and can be representead as:

\[A(i,j) = \frac{supp(i, j)}{supp(i)+supp(j)-supp(i, j)}\]

The higher the value, the more likely two products will be bought together.

a = affinity(trans.sel)
round(a, digits=3)

## An object of class "ar_similarity"
##                        Bag of Organic Bananas Banana Large Lemon Organic Baby Spinach Organic Strawberries
## Bag of Organic Bananas                  0.000  0.001       0.047                0.097                0.132
## Banana                                  0.001  0.000       0.087                0.075                0.079
## Large Lemon                             0.047  0.087       0.000                0.074                0.056
## Organic Baby Spinach                    0.097  0.075       0.074                0.000                0.086
## Organic Strawberries                    0.132  0.079       0.056                0.086                0.000
## Slot "method":
## [1] "Affinity"

par(mar=c(4,8,4,4))
image(a, axes=FALSE)
axis(1,at=seq(0,1,l=ncol(a)),labels=rownames(a),cex.axis=0.5)
axis(2,at=seq(0,1,l=ncol(a)),labels=rownames(a),cex.axis=0.6, las=2)

The results are as expected, mostly low values. The results can be also plotted, but in comparison to matrix, plot is inverted, so the diagonal with zeros is not going from top left to bottom right but from bottom right to top left.

The red rectangles indicate that the value of affinity measure is really low. In this case ‘Bananas’ and ‘Bag of Organic Bananas’ have the lowest values, so Affinity measure is basiclly \(1-Jaccard\ Index\).

Conclusions

To sum up, the analysis above can be used for better placement of products. At the beginning, the strongest rules were discovered, but those baskets were rather unusual cases. The anaylysis were also performed on three products:

‘Bananas’ and ‘Bag of Organic Bananas’ are the two most frequent items. The rules show that those two products are mostly bought with other organic vegetables or fruits. This shows that those products should be placed very close to other fruits and vegetables in the shop.
‘Zero Calorie Cola’ were mostly bought with some kind of snacks. It clearly shows that the shop can offer e.g. packets of Cola and crisps in order to sell more of those products.

Market Basket Analysis is a powerful tool to get better knowledge about customers’ behaviour. It can help shops to increase cross-sell and be more profitable.

Market Basket Analysis on instacard dataset

Krystian Andruszek