Market Basket Analysis

Introduction

Association rule is one of machine learning methods, that helps uncover relationships between vairbles. It’s aim is to show how frequently items appear in transactions. It can help predict client behaviour: if a client bought item A, which product will they choose next? This method is called Market Basket Analysis. In this paper a Groceries Dataset found on Kaggle is used.

library(arules)
## Ładowanie wymaganego pakietu: Matrix
## 
## Dołączanie pakietu: 'arules'
## Następujące obiekty zostały zakryte z 'package:base':
## 
##     abbreviate, write
library(arulesViz)
library(dplyr)
## 
## Dołączanie pakietu: 'dplyr'
## Następujące obiekty zostały zakryte z 'package:arules':
## 
##     intersect, recode, setdiff, setequal, union
## Następujące obiekty zostały zakryte z 'package:stats':
## 
##     filter, lag
## Następujące obiekty zostały zakryte z 'package:base':
## 
##     intersect, setdiff, setequal, union

Dataset

In order to work on the data, we first need to load the dataset.

df = read.csv('Groceries_dataset.csv', row.names=NULL, sep=",")
head(df)
##   Member_number       Date  itemDescription
## 1          1808 21-07-2015   tropical fruit
## 2          2552 05-01-2015       whole milk
## 3          2300 19-09-2015        pip fruit
## 4          1187 12-12-2015 other vegetables
## 5          3037 01-02-2015       whole milk
## 6          4941 14-02-2015       rolls/buns
str(df)
## 'data.frame':    38765 obs. of  3 variables:
##  $ Member_number  : int  1808 2552 2300 1187 3037 4941 4501 3803 2762 4119 ...
##  $ Date           : chr  "21-07-2015" "05-01-2015" "19-09-2015" "12-12-2015" ...
##  $ itemDescription: chr  "tropical fruit" "whole milk" "pip fruit" "other vegetables" ...

The dataset has 38 765 observations of 3 variables:

* Member_numer - unique ID of a client
* Date - date of purchase
* itemDescription - category of item purchased
summary(df)
##  Member_number      Date           itemDescription   
##  Min.   :1000   Length:38765       Length:38765      
##  1st Qu.:2002   Class :character   Class :character  
##  Median :3005   Mode  :character   Mode  :character  
##  Mean   :3004                                        
##  3rd Qu.:4007                                        
##  Max.   :5000

As the Date variable is a character, it needs to be transformed into date format.

df$Date <-as.Date(df$Date, format="%d-%m-%Y")

In order to perform market basket analysis, it’s necessary to format the dataset into a transcations object. Grouping data by Member_number and Date allows us to create unique transactions. Grouping by Member_number alone would mean multiple purchases by the same client would be mixed together.

transactions_list <- df %>%
  group_by(Member_number, Date) %>%
  summarise(items = list(itemDescription)) %>%
  ungroup() %>% 
  .$items
## `summarise()` has grouped output by 'Member_number'. You can override using the
## `.groups` argument.
transactions<- as(transactions_list, "transactions")
## Warning in asMethod(object): removing duplicated items in transactions
summary(transactions)
## transactions as itemMatrix in sparse format with
##  14963 rows (elements/itemsets/transactions) and
##  167 columns (items) and a density of 0.01520957 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2363             1827             1646             1453 
##           yogurt          (Other) 
##             1285            29432 
## 
## element (itemset/transaction) length distribution:
## sizes
##     1     2     3     4     5     6     7     8     9    10 
##   205 10012  2727  1273   338   179   113    96    19     1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00    2.00    2.54    3.00   10.00 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

From the summary above we can gain basic information, like most frequently bought items: whole milk, other vegetables, rolls/buns, soda, yogurt.

To gather even more information about the transactions, we can create a frequency plot, showing 20 most often bought products.

itemFrequencyPlot(transactions, topN=20, type="absolute")

Most often bought products are shown on the frequency plot, but we can also explore products that were least popular:

item_freq <- as.data.frame(itemFrequency(transactions,type="absolute"), cols = 'product')

colnames(item_freq) <- 'nb_of_purchases'

item_freq$product_names <- names(itemFrequency(transactions, type = "absolute"))


item_freq %>% 
  group_by(.,nb_of_purchases) %>% 
  summarise(
    nb_of_products = n(),
    product_names = paste(product_names, collapse=",")
    ) %>% 
  head(.,5)
## # A tibble: 5 × 3
##   nb_of_purchases nb_of_products product_names                                  
##             <int>          <int> <chr>                                          
## 1               1              2 kitchen utensil,preservation products          
## 2               3              1 baby cosmetics                                 
## 3               4              1 bags                                           
## 4               5              4 frozen chicken,make up remover,rubbing alcohol…
## 5               6              1 salad dressing

There are 2 products that were only bought once: kitchen utensils and preservation products. Baby cosmetics were bought 3 times, while bags were bought 4 times.

Rules

Association rules are based on couple key concepts. First is support, which is the proportion of transactions in the dataset in which a particular item appears. The higher the support the more often the item appears in the dataset. Second is confidence, the probability that item B will appear in a transaction, if item A is already in the basket. High confidence means a strong likelihood that B will appear when A does. Lastly, lift measures the strength of the association between products. Lift lower than 1 means that items are negatively correlated, while lift above 1 means positive correlation. If items are independent, lift will be equal to 1.

We have to define those statistics to be able to analyze rules and patterns. By default support is set to 0.1

freq_items<-eclat(transactions, parameter=list(supp=0.1, maxlen=10))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE     0.1      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 1496 
## 
## create itemset ... 
## set transactions ...[167 item(s), 14963 transaction(s)] done [0.00s].
## sorting and recoding items ... [3 item(s)] done [0.00s].
## creating sparse bit matrix ... [3 row(s), 14963 column(s)] done [0.00s].
## writing  ... [3 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].
inspect(freq_items)
##     items              support   count
## [1] {whole milk}       0.1579229 2363 
## [2] {other vegetables} 0.1221012 1827 
## [3] {rolls/buns}       0.1100047 1646

With support set to 0.1 only shows 3 frequently bought items. In order to get more results we need to set support lower.

freq_items<-eclat(transactions, parameter=list(supp=0.05, maxlen=10))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.05      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 748 
## 
## create itemset ... 
## set transactions ...[167 item(s), 14963 transaction(s)] done [0.00s].
## sorting and recoding items ... [11 item(s)] done [0.00s].
## creating sparse bit matrix ... [11 row(s), 14963 column(s)] done [0.00s].
## writing  ... [11 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].
inspect(freq_items)
##      items              support    count
## [1]  {whole milk}       0.15792288 2363 
## [2]  {other vegetables} 0.12210118 1827 
## [3]  {rolls/buns}       0.11000468 1646 
## [4]  {soda}             0.09710620 1453 
## [5]  {yogurt}           0.08587850 1285 
## [6]  {tropical fruit}   0.06776716 1014 
## [7]  {root vegetables}  0.06957161 1041 
## [8]  {sausage}          0.06034886  903 
## [9]  {bottled water}    0.06068302  908 
## [10] {citrus fruit}     0.05313106  795 
## [11] {pastry}           0.05172759  774

All of the frequent sets however are one-item baskets. In order to get item sets of two or more items, the support needs to be set even lower.

freq_items<-eclat(transactions, parameter=list(supp=0.001, maxlen=10))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE   0.001      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 14 
## 
## create itemset ... 
## set transactions ...[167 item(s), 14963 transaction(s)] done [0.01s].
## sorting and recoding items ... [149 item(s)] done [0.00s].
## creating sparse bit matrix ... [149 row(s), 14963 column(s)] done [0.00s].
## writing  ... [750 set(s)] done [0.02s].
## Creating S4 object  ... done [0.00s].
freq_rules<- ruleInduction(freq_items, transactions, confidence=0.1)
summary(freq_rules)
## set of 131 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3 
## 114  17 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    2.00    2.00    2.13    2.00    3.00 
## 
## summary of quality measures:
##     support           confidence          lift           itemset     
##  Min.   :0.001002   Min.   :0.1000   Min.   :0.6458   Min.   :  1.0  
##  1st Qu.:0.001337   1st Qu.:0.1098   1st Qu.:0.8074   1st Qu.: 42.0  
##  Median :0.001938   Median :0.1215   Median :0.8795   Median :164.0  
##  Mean   :0.002933   Mean   :0.1257   Mean   :0.9464   Mean   :247.8  
##  3rd Qu.:0.003776   3rd Qu.:0.1347   3rd Qu.:1.0319   3rd Qu.:495.5  
##  Max.   :0.014837   Max.   :0.2558   Max.   :2.1829   Max.   :601.0  
## 
## mining info:
##          data ntransactions support
##  transactions         14963   0.001
##                                                                     call
##  eclat(data = transactions, parameter = list(supp = 0.001, maxlen = 10))
##  confidence
##         0.1

By setting support to 0.001 and confidence to 0.1, we get 131 rules. Most of them, 114, have the size of two items, which means both lhs (left hand side) and rhs (right hand side) are one product. 17 rules are of size 3, meaning that lhs is two items and rhs is one.

From the summary above we also get mean values of support, confidence and lift. Average support is equal to 0.0029, confidence is 0.13 and lift is 0.95.

Top rules analysis

Top rules by metrics

We can analyze the rules by sorting them by confidence, support and lift. This can help identify the most relevant insights.

rules.by.conf<-sort(freq_rules, by="confidence", decreasing=TRUE) 
inspect(head(rules.by.conf))
##     lhs                      rhs          support     confidence lift    
## [1] {sausage, yogurt}     => {whole milk} 0.001470293 0.2558140  1.619866
## [2] {rolls/buns, sausage} => {whole milk} 0.001136136 0.2125000  1.345594
## [3] {sausage, soda}       => {whole milk} 0.001069304 0.1797753  1.138374
## [4] {semi-finished bread} => {whole milk} 0.001670788 0.1760563  1.114825
## [5] {rolls/buns, yogurt}  => {whole milk} 0.001336630 0.1709402  1.082428
## [6] {sausage, whole milk} => {yogurt}     0.001470293 0.1641791  1.911760
##     itemset
## [1] 565    
## [2] 567    
## [3] 566    
## [4]  11    
## [5] 586    
## [6] 565

Confidence measures how often does rhs item appear, if the transaction contains the lhs item. The highest confidence level is 25.58%, meaning that in 25.58% of transactions, where sausage and yogurt appear, whole milk will also appear.

rules.by.supp<-sort(freq_rules, by="support", decreasing=TRUE) 
inspect(head(rules.by.supp))
##     lhs                   rhs          support     confidence lift      itemset
## [1] {other vegetables} => {whole milk} 0.014836597 0.1215107  0.7694305 601    
## [2] {rolls/buns}       => {whole milk} 0.013967787 0.1269745  0.8040284 599    
## [3] {soda}             => {whole milk} 0.011628684 0.1197522  0.7582957 595    
## [4] {yogurt}           => {whole milk} 0.011160863 0.1299611  0.8229402 588    
## [5] {sausage}          => {whole milk} 0.008955423 0.1483942  0.9396627 568    
## [6] {tropical fruit}   => {whole milk} 0.008220277 0.1213018  0.7681077 581

Support shows the proportion os transactions in which a rule appears. Highest support is equal to 0.0148, meaning the rule “other vegetables => whole milk” appears in 1.48% of all transactions.

rules.by.lift<-sort(freq_rules, by="lift", decreasing=TRUE) 
inspect(head(rules.by.lift))
##     lhs                      rhs               support     confidence lift    
## [1] {whole milk, yogurt}  => {sausage}         0.001470293 0.1317365  2.182917
## [2] {sausage, whole milk} => {yogurt}          0.001470293 0.1641791  1.911760
## [3] {sausage, yogurt}     => {whole milk}      0.001470293 0.2558140  1.619866
## [4] {flour}               => {tropical fruit}  0.001069304 0.1095890  1.617141
## [5] {processed cheese}    => {root vegetables} 0.001069304 0.1052632  1.513019
## [6] {soft cheese}         => {yogurt}          0.001269799 0.1266667  1.474952
##     itemset
## [1] 565    
## [2] 565    
## [3] 565    
## [4]  16    
## [5]  21    
## [6]  25

Lift measures how much more likley the items on the right-hand side, are to be bought when lhs items are purchased. Highest lift is equal to 2.18. The likelihood of purchasing sausage is increase 2.18 times, when whole milk and yogurt are bought.

Visualisations

We can also plot the rules using scatter plot, to include all three parameters: support, confidence, which are plotted on x and y axis, as well as lift, which is added by shading the dots.

plot(freq_rules, measure=c("support", "confidence"), shading="lift")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Majority of rules have support below 0.005 and confidence below 0.15. Rules with highest confidence have really low support, just like rules with highest lift.

In order to find rules that appear in baskets most often and are bought together we need to sort by support and confidence.

inspect(head(sort(sort(freq_rules, by ="confidence"),by="support"),10))
##      lhs                   rhs          support     confidence lift     
## [1]  {other vegetables} => {whole milk} 0.014836597 0.1215107  0.7694305
## [2]  {rolls/buns}       => {whole milk} 0.013967787 0.1269745  0.8040284
## [3]  {soda}             => {whole milk} 0.011628684 0.1197522  0.7582957
## [4]  {yogurt}           => {whole milk} 0.011160863 0.1299611  0.8229402
## [5]  {sausage}          => {whole milk} 0.008955423 0.1483942  0.9396627
## [6]  {tropical fruit}   => {whole milk} 0.008220277 0.1213018  0.7681077
## [7]  {root vegetables}  => {whole milk} 0.007551962 0.1085495  0.6873575
## [8]  {bottled beer}     => {whole milk} 0.007150972 0.1578171  0.9993303
## [9]  {citrus fruit}     => {whole milk} 0.007150972 0.1345912  0.8522590
## [10] {bottled water}    => {whole milk} 0.007150972 0.1178414  0.7461959
##      itemset
## [1]  601    
## [2]  599    
## [3]  595    
## [4]  588    
## [5]  568    
## [6]  581    
## [7]  575    
## [8]  488    
## [9]  548    
## [10] 557

For all rules whole milk is on rhs, as it is the most frequently appearing item. To visualize the rules we can use a matrix, which plots lhs on x axis, rhs on y axis and adds lift by shading.

plot_rules <- freq_rules %>%
  sort(by = "confidence") %>%
  head(10) %>%
  sort(by = "support")
plot(plot_rules, method="matrix", measure="lift")
## Itemsets in Antecedent (LHS)
##  [1] "{sausage,whole milk}"  "{sausage,yogurt}"      "{rolls/buns,sausage}" 
##  [4] "{sausage,soda}"        "{semi-finished bread}" "{rolls/buns,yogurt}"  
##  [7] "{detergent}"           "{ham}"                 "{bottled beer}"       
## [10] "{frozen fish}"        
## Itemsets in Consequent (RHS)
## [1] "{whole milk}" "{yogurt}"

If we want to analyze and visualize relationships between items further, a good choice is a parallel coordinates plot. It shows individuals choices, based on what they already have in their basket.

plot(plot_rules, method="paracoord")

In this case, if the person buys whole milk and sausage, they are likely to also buy yogurt.

Individual rules

To deepen the analysis, it’s possible to check what motivates people to buy certain items. To showcase this, we will analyse whole milk, as the most frequently bought item.

Whole milk - rules

whole_milk <-apriori(data = transactions, parameter = list(support=0.001, confidence=0.1), appearance = list(default="lhs", rhs="whole milk"), control=list(verbose=F))
inspect(sort(whole_milk, by="lift"))
##      lhs                               rhs          support     confidence
## [1]  {sausage, yogurt}              => {whole milk} 0.001470293 0.2558140 
## [2]  {rolls/buns, sausage}          => {whole milk} 0.001136136 0.2125000 
## [3]  {sausage, soda}                => {whole milk} 0.001069304 0.1797753 
## [4]  {semi-finished bread}          => {whole milk} 0.001670788 0.1760563 
## [5]  {rolls/buns, yogurt}           => {whole milk} 0.001336630 0.1709402 
## [6]  {detergent}                    => {whole milk} 0.001403462 0.1627907 
## [7]  {ham}                          => {whole milk} 0.002740092 0.1601562 
## [8]  {}                             => {whole milk} 0.157922876 0.1579229 
## [9]  {bottled beer}                 => {whole milk} 0.007150972 0.1578171 
## [10] {frozen fish}                  => {whole milk} 0.001069304 0.1568627 
## [11] {candy}                        => {whole milk} 0.002138609 0.1488372 
## [12] {sausage}                      => {whole milk} 0.008955423 0.1483942 
## [13] {onions}                       => {whole milk} 0.002940587 0.1452145 
## [14] {processed cheese}             => {whole milk} 0.001470293 0.1447368 
## [15] {newspapers}                   => {whole milk} 0.005613847 0.1443299 
## [16] {domestic eggs}                => {whole milk} 0.005279690 0.1423423 
## [17] {cat food}                     => {whole milk} 0.001670788 0.1412429 
## [18] {waffles}                      => {whole milk} 0.002606429 0.1407942 
## [19] {hamburger meat}               => {whole milk} 0.003074250 0.1406728 
## [20] {other vegetables, yogurt}     => {whole milk} 0.001136136 0.1404959 
## [21] {frankfurter}                  => {whole milk} 0.005279690 0.1398230 
## [22] {sugar}                        => {whole milk} 0.002472766 0.1396226 
## [23] {chewing gum}                  => {whole milk} 0.001670788 0.1388889 
## [24] {beef}                         => {whole milk} 0.004678206 0.1377953 
## [25] {flour}                        => {whole milk} 0.001336630 0.1369863 
## [26] {frozen vegetables}            => {whole milk} 0.003809397 0.1360382 
## [27] {pork}                         => {whole milk} 0.005012364 0.1351351 
## [28] {pip fruit}                    => {whole milk} 0.006616320 0.1348774 
## [29] {citrus fruit}                 => {whole milk} 0.007150972 0.1345912 
## [30] {long life bakery product}     => {whole milk} 0.002405935 0.1343284 
## [31] {grapes}                       => {whole milk} 0.001938114 0.1342593 
## [32] {shopping bags}                => {whole milk} 0.006348994 0.1334270 
## [33] {butter}                       => {whole milk} 0.004678206 0.1328273 
## [34] {pasta}                        => {whole milk} 0.001069304 0.1322314 
## [35] {meat}                         => {whole milk} 0.002205440 0.1309524 
## [36] {white bread}                  => {whole milk} 0.003141081 0.1309192 
## [37] {oil}                          => {whole milk} 0.001938114 0.1300448 
## [38] {yogurt}                       => {whole milk} 0.011160863 0.1299611 
## [39] {fruit/vegetable juice}        => {whole milk} 0.004410880 0.1296660 
## [40] {pot plants}                   => {whole milk} 0.001002473 0.1282051 
## [41] {canned beer}                  => {whole milk} 0.006014837 0.1282051 
## [42] {ice cream}                    => {whole milk} 0.001938114 0.1277533 
## [43] {hard cheese}                  => {whole milk} 0.001871282 0.1272727 
## [44] {rolls/buns}                   => {whole milk} 0.013967787 0.1269745 
## [45] {hygiene articles}             => {whole milk} 0.001737619 0.1268293 
## [46] {margarine}                    => {whole milk} 0.004076723 0.1265560 
## [47] {pastry}                       => {whole milk} 0.006482657 0.1253230 
## [48] {chocolate}                    => {whole milk} 0.002940587 0.1246459 
## [49] {rolls/buns, soda}             => {whole milk} 0.001002473 0.1239669 
## [50] {curd}                         => {whole milk} 0.004143554 0.1230159 
## [51] {chicken}                      => {whole milk} 0.003408407 0.1223022 
## [52] {other vegetables}             => {whole milk} 0.014836597 0.1215107 
## [53] {cream cheese }                => {whole milk} 0.002873755 0.1214689 
## [54] {tropical fruit}               => {whole milk} 0.008220277 0.1213018 
## [55] {coffee}                       => {whole milk} 0.003809397 0.1205074 
## [56] {soft cheese}                  => {whole milk} 0.001202967 0.1200000 
## [57] {soda}                         => {whole milk} 0.011628684 0.1197522 
## [58] {specialty bar}                => {whole milk} 0.001670788 0.1196172 
## [59] {brown bread}                  => {whole milk} 0.004477712 0.1190053 
## [60] {UHT-milk}                     => {whole milk} 0.002539598 0.1187500 
## [61] {bottled water}                => {whole milk} 0.007150972 0.1178414 
## [62] {other vegetables, soda}       => {whole milk} 0.001136136 0.1172414 
## [63] {beverages}                    => {whole milk} 0.001938114 0.1169355 
## [64] {frozen meals}                 => {whole milk} 0.001938114 0.1155378 
## [65] {other vegetables, rolls/buns} => {whole milk} 0.001202967 0.1139241 
## [66] {pickled vegetables}           => {whole milk} 0.001002473 0.1119403 
## [67] {napkins}                      => {whole milk} 0.002405935 0.1087613 
## [68] {white wine}                   => {whole milk} 0.001269799 0.1085714 
## [69] {root vegetables}              => {whole milk} 0.007551962 0.1085495 
## [70] {herbs}                        => {whole milk} 0.001136136 0.1075949 
## [71] {whipped/sour cream}           => {whole milk} 0.004611375 0.1055046 
## [72] {sliced cheese}                => {whole milk} 0.001470293 0.1047619 
## [73] {berries}                      => {whole milk} 0.002272272 0.1042945 
## [74] {salty snack}                  => {whole milk} 0.001938114 0.1032028 
## [75] {dessert}                      => {whole milk} 0.002405935 0.1019830 
##      coverage    lift      count
## [1]  0.005747511 1.6198664   22 
## [2]  0.005346521 1.3455935   17 
## [3]  0.005948005 1.1383739   16 
## [4]  0.009490076 1.1148248   25 
## [5]  0.007819288 1.0824282   20 
## [6]  0.008621266 1.0308240   21 
## [7]  0.017108869 1.0141422   41 
## [8]  1.000000000 1.0000000 2363 
## [9]  0.045311769 0.9993303  107 
## [10] 0.006816815 0.9932870   16 
## [11] 0.014368776 0.9424677   32 
## [12] 0.060348861 0.9396627  134 
## [13] 0.020249950 0.9195281   44 
## [14] 0.010158391 0.9165033   22 
## [15] 0.038895943 0.9139265   84 
## [16] 0.037091492 0.9013409   79 
## [17] 0.011829179 0.8943792   25 
## [18] 0.018512330 0.8915379   39 
## [19] 0.021853906 0.8907689   46 
## [20] 0.008086614 0.8896486   17 
## [21] 0.037759808 0.8853879   79 
## [22] 0.017710352 0.8841192   37 
## [23] 0.012029673 0.8794729   25 
## [24] 0.033950411 0.8725479   70 
## [25] 0.009757402 0.8674253   20 
## [26] 0.028002406 0.8614217   57 
## [27] 0.037091492 0.8557034   75 
## [28] 0.049054334 0.8540712   99 
## [29] 0.053131057 0.8522590  107 
## [30] 0.017910847 0.8505947   36 
## [31] 0.014435608 0.8501571   29 
## [32] 0.047584041 0.8448869   95 
## [33] 0.035220210 0.8410898   70 
## [34] 0.008086614 0.8373163   16 
## [35] 0.016841542 0.8292173   33 
## [36] 0.023992515 0.8290073   47 
## [37] 0.014903428 0.8234706   29 
## [38] 0.085878500 0.8229402  167 
## [39] 0.034017243 0.8210717   66 
## [40] 0.007819288 0.8118211   15 
## [41] 0.046915725 0.8118211   90 
## [42] 0.015170755 0.8089601   29 
## [43] 0.014702934 0.8059170   28 
## [44] 0.110004678 0.8040284  209 
## [45] 0.013700461 0.8031089   26 
## [46] 0.032212792 0.8013786   61 
## [47] 0.051727595 0.7935709   97 
## [48] 0.023591526 0.7892833   44 
## [49] 0.008086614 0.7849841   15 
## [50] 0.033683085 0.7789617   62 
## [51] 0.027868743 0.7744423   51 
## [52] 0.122101183 0.7694305  222 
## [53] 0.023658357 0.7691661   43 
## [54] 0.067767159 0.7681077  123 
## [55] 0.031611308 0.7630775   57 
## [56] 0.010024728 0.7598646   18 
## [57] 0.097106195 0.7582957  174 
## [58] 0.013967787 0.7574408   25 
## [59] 0.037626144 0.7535661   67 
## [60] 0.021386086 0.7519493   38 
## [61] 0.060683018 0.7461959  107 
## [62] 0.009690570 0.7423964   17 
## [63] 0.016574216 0.7404594   29 
## [64] 0.016774711 0.7316093   29 
## [65] 0.010559380 0.7213904   18 
## [66] 0.008955423 0.7088289   15 
## [67] 0.022121232 0.6886990   36 
## [68] 0.011695516 0.6874965   19 
## [69] 0.069571610 0.6873575  113 
## [70] 0.010559380 0.6813132   17 
## [71] 0.043707813 0.6680767   69 
## [72] 0.014034619 0.6633738   22 
## [73] 0.021787075 0.6604140   34 
## [74] 0.018779656 0.6535016   29 
## [75] 0.023591526 0.6457773   36
is.significant(whole_milk, transactions)
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE

Above function tests whether found rules are statistically signifact using Fisher’s Exact Test. This output means that rules found for “whole milk” are found, but they are not statistically significant, which means the association might be weak. A solution could be to increase support and confidence, but that would mean less rules would show up. Instead, we can try find significant rules for other products.

Sausage - rules

Instead of choosing most freguently bought items, let’s focus on an item appearing in a rule with the highest lift, which is sausage.

sausage <-apriori(data = transactions, parameter = list(support=0.001, confidence=0.1), appearance = list(default="lhs", rhs="sausage"), control=list(verbose=F))
inspect(sort(sausage, by="lift"))
##     lhs                     rhs       support     confidence coverage  
## [1] {whole milk, yogurt} => {sausage} 0.001470293 0.1317365  0.01116086
##     lift     count
## [1] 2.182917 22
is.significant(sausage, transactions)
## [1] TRUE

The rule found for sausage shows a meaningful relationship between items. When whole milk and yogurt are purchased together, the customer is likely to buy sausage as well.

Conclusions

Market Basket Analysis can be a useful tool, providing valuable insight into customer behaviour. In this paper a dataset of grocery transactions was analyzes. By using association rule mining frequently bought items were identified and relationships between the explored. Support, confidence and lift metrics were essential in that analysis.

Whole milk appears to be most frequently bought item, however rules predicting milk purchase were not statistically significant. This might happen because it’s a staple item, bought often and independently of other products. Sausage, yogurt and whole milk exhibited a strong association. If a customer bough whole milk and yogurt, they were more likely to also buy sausage.

Majority of rules involved small, two-product itemsets. This means customers typically purchase specific pairs of products, rather than larger sets of items. A couple of niche purchases were made, including kitchen utensils and baby cosmetics.

The aim of this paper was to showcase market basket analysis using association rule mining. Insights like this can be used in several different ways. Businesses might use such analysis to create personalized promotions or create cheaper product bundles, based on strong association between items. Market basket analysis is a powerful method, which helps to understand customer behaviour.