Introduction

The aim of this study is to use association rules to identify various patterns and dependencies related to consumer choices in their market baskets at a grocery store. The data is sourced from Kaggle link, and the dataset used is located in Data Sources -> Groceries dataset -> Groceries_dataset.csv. Due to the requirements of the functions used, transactions have been preliminarily grouped by transaction date and customer to ultimately obtain a text file where each row corresponds to a single market basket. Python was used for data transformation.

Import neccesary packages

library(arules)
library(arulesViz)

Dataset preparation

Firstly, let’s load the data and take a look at the statistics:

transactions<-read.transactions("transactions.txt", format="basket",
                                sep=",", skip=0, quote="", rm.duplicates = FALSE)
summary(transactions)
## transactions as itemMatrix in sparse format with
##  14963 rows (elements/itemsets/transactions) and
##  167 columns (items) and a density of 0.01520957 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2363             1827             1646             1453 
##           yogurt          (Other) 
##             1285            29432 
## 
## element (itemset/transaction) length distribution:
## sizes
##     1     2     3     4     5     6     7     8     9    10 
##   205 10012  2727  1273   338   179   113    96    19     1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00    2.00    2.54    3.00   10.00 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics
inspect(transactions[1:10])
##      items                                             
## [1]  {sausage, semi-finished bread, whole milk, yogurt}
## [2]  {pastry, salty snack, whole milk}                 
## [3]  {canned beer, misc. beverages}                    
## [4]  {hygiene articles, sausage}                       
## [5]  {pickled vegetables, soda}                        
## [6]  {curd, frankfurter}                               
## [7]  {rolls/buns, sausage, whole milk}                 
## [8]  {soda, whole milk}                                
## [9]  {beef, white bread}                               
## [10] {frankfurter, soda, whipped/sour cream}

As seen above, we have 14963 transactions and 167 different products. The most frequently purchased item is whole milk, and consumers most commonly buy two products per transaction. Additionally, ten sample baskets have been presented.

Below are charts presenting item frequency for 20 the most popular products - in relative and absolute terms:

itemFrequencyPlot(transactions, type = "relative", topN = 20, col = "skyblue", main = "Item Frequency - Relative")

itemFrequencyPlot(transactions, type = "absolute", topN = 20, col = "lightgreen", main = "Item Frequency - Absolute")

Assosciation Rules - Apriori Algorithm

Now let’s move on to the Apriori Algorithm. The support level was set to 0.002, confidence to 0.1 and min length of the rules is 2. The thresholds were set as mentioned above because, for higher support or confidence levels, the algorithm either did not find any rules, or the number of rules was too small, leading to results that lacked meaningful interpretation. In the end, we obtained 61 rules, all of which are two-element rules. The algorithm results are as follows:

transactionsrules <- apriori(transactions, parameter = list(support = 0.002, confidence = 0.1, minlen = 2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.1    0.1    1 none FALSE            TRUE       5   0.002      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 29 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[167 item(s), 14963 transaction(s)] done [0.00s].
## sorting and recoding items ... [126 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [61 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
inspect(transactionsrules)
##      lhs                           rhs                support     confidence
## [1]  {candy}                    => {whole milk}       0.002138609 0.1488372 
## [2]  {meat}                     => {other vegetables} 0.002138609 0.1269841 
## [3]  {meat}                     => {whole milk}       0.002205440 0.1309524 
## [4]  {ham}                      => {whole milk}       0.002740092 0.1601562 
## [5]  {frozen meals}             => {other vegetables} 0.002138609 0.1274900 
## [6]  {sugar}                    => {whole milk}       0.002472766 0.1396226 
## [7]  {long life bakery product} => {whole milk}       0.002405935 0.1343284 
## [8]  {waffles}                  => {whole milk}       0.002606429 0.1407942 
## [9]  {salty snack}              => {other vegetables} 0.002205440 0.1174377 
## [10] {onions}                   => {whole milk}       0.002940587 0.1452145 
## [11] {UHT-milk}                 => {other vegetables} 0.002138609 0.1000000 
## [12] {UHT-milk}                 => {whole milk}       0.002539598 0.1187500 
## [13] {berries}                  => {other vegetables} 0.002673261 0.1226994 
## [14] {berries}                  => {whole milk}       0.002272272 0.1042945 
## [15] {hamburger meat}           => {other vegetables} 0.002205440 0.1009174 
## [16] {hamburger meat}           => {whole milk}       0.003074250 0.1406728 
## [17] {dessert}                  => {whole milk}       0.002405935 0.1019830 
## [18] {napkins}                  => {whole milk}       0.002405935 0.1087613 
## [19] {cream cheese}             => {whole milk}       0.002873755 0.1214689 
## [20] {chocolate}                => {rolls/buns}       0.002806924 0.1189802 
## [21] {chocolate}                => {whole milk}       0.002940587 0.1246459 
## [22] {white bread}              => {other vegetables} 0.002606429 0.1086351 
## [23] {white bread}              => {whole milk}       0.003141081 0.1309192 
## [24] {chicken}                  => {rolls/buns}       0.002873755 0.1031175 
## [25] {chicken}                  => {whole milk}       0.003408407 0.1223022 
## [26] {frozen vegetables}        => {other vegetables} 0.003141081 0.1121718 
## [27] {frozen vegetables}        => {whole milk}       0.003809397 0.1360382 
## [28] {coffee}                   => {whole milk}       0.003809397 0.1205074 
## [29] {margarine}                => {whole milk}       0.004076723 0.1265560 
## [30] {beef}                     => {whole milk}       0.004678206 0.1377953 
## [31] {fruit/vegetable juice}    => {rolls/buns}       0.003742565 0.1100196 
## [32] {fruit/vegetable juice}    => {whole milk}       0.004410880 0.1296660 
## [33] {curd}                     => {other vegetables} 0.003542070 0.1051587 
## [34] {curd}                     => {whole milk}       0.004143554 0.1230159 
## [35] {butter}                   => {whole milk}       0.004678206 0.1328273 
## [36] {pork}                     => {other vegetables} 0.003943060 0.1063063 
## [37] {pork}                     => {whole milk}       0.005012364 0.1351351 
## [38] {domestic eggs}            => {whole milk}       0.005279690 0.1423423 
## [39] {brown bread}              => {whole milk}       0.004477712 0.1190053 
## [40] {newspapers}               => {whole milk}       0.005613847 0.1443299 
## [41] {frankfurter}              => {other vegetables} 0.005146027 0.1362832 
## [42] {frankfurter}              => {whole milk}       0.005279690 0.1398230 
## [43] {whipped/sour cream}       => {whole milk}       0.004611375 0.1055046 
## [44] {bottled beer}             => {other vegetables} 0.004678206 0.1032448 
## [45] {bottled beer}             => {whole milk}       0.007150972 0.1578171 
## [46] {canned beer}              => {whole milk}       0.006014837 0.1282051 
## [47] {shopping bags}            => {other vegetables} 0.004945532 0.1039326 
## [48] {shopping bags}            => {whole milk}       0.006348994 0.1334270 
## [49] {pip fruit}                => {rolls/buns}       0.004945532 0.1008174 
## [50] {pip fruit}                => {other vegetables} 0.004945532 0.1008174 
## [51] {pip fruit}                => {whole milk}       0.006616320 0.1348774 
## [52] {pastry}                   => {whole milk}       0.006482657 0.1253230 
## [53] {citrus fruit}             => {whole milk}       0.007150972 0.1345912 
## [54] {bottled water}            => {whole milk}       0.007150972 0.1178414 
## [55] {sausage}                  => {whole milk}       0.008955423 0.1483942 
## [56] {root vegetables}          => {whole milk}       0.007551962 0.1085495 
## [57] {tropical fruit}           => {whole milk}       0.008220277 0.1213018 
## [58] {yogurt}                   => {whole milk}       0.011160863 0.1299611 
## [59] {soda}                     => {whole milk}       0.011628684 0.1197522 
## [60] {rolls/buns}               => {whole milk}       0.013967787 0.1269745 
## [61] {other vegetables}         => {whole milk}       0.014836597 0.1215107 
##      coverage   lift      count
## [1]  0.01436878 0.9424677  32  
## [2]  0.01684154 1.0399910  32  
## [3]  0.01684154 0.8292173  33  
## [4]  0.01710887 1.0141422  41  
## [5]  0.01677471 1.0441344  32  
## [6]  0.01771035 0.8841192  37  
## [7]  0.01791085 0.8505947  36  
## [8]  0.01851233 0.8915379  39  
## [9]  0.01877966 0.9618066  33  
## [10] 0.02024995 0.9195281  44  
## [11] 0.02138609 0.8189929  32  
## [12] 0.02138609 0.7519493  38  
## [13] 0.02178707 1.0048992  40  
## [14] 0.02178707 0.6604140  34  
## [15] 0.02185391 0.8265066  33  
## [16] 0.02185391 0.8907689  46  
## [17] 0.02359153 0.6457773  36  
## [18] 0.02212123 0.6886990  36  
## [19] 0.02365836 0.7691661  43  
## [20] 0.02359153 1.0815919  42  
## [21] 0.02359153 0.7892833  44  
## [22] 0.02399251 0.8897137  39  
## [23] 0.02399251 0.8290073  47  
## [24] 0.02786874 0.9373920  43  
## [25] 0.02786874 0.7744423  51  
## [26] 0.02800241 0.9186794  47  
## [27] 0.02800241 0.8614217  57  
## [28] 0.03161131 0.7630775  57  
## [29] 0.03221279 0.8013786  61  
## [30] 0.03395041 0.8725479  70  
## [31] 0.03401724 1.0001361  56  
## [32] 0.03401724 0.8210717  66  
## [33] 0.03368308 0.8612425  53  
## [34] 0.03368308 0.7789617  62  
## [35] 0.03522021 0.8410898  70  
## [36] 0.03709149 0.8706411  59  
## [37] 0.03709149 0.8557034  75  
## [38] 0.03709149 0.9013409  79  
## [39] 0.03762614 0.7535661  67  
## [40] 0.03889594 0.9139265  84  
## [41] 0.03775981 1.1161496  77  
## [42] 0.03775981 0.8853879  79  
## [43] 0.04370781 0.6680767  69  
## [44] 0.04531177 0.8455679  70  
## [45] 0.04531177 0.9993303 107  
## [46] 0.04691573 0.8118211  90  
## [47] 0.04758404 0.8512005  74  
## [48] 0.04758404 0.8448869  95  
## [49] 0.04905433 0.9164832  74  
## [50] 0.04905433 0.8256876  74  
## [51] 0.04905433 0.8540712  99  
## [52] 0.05172759 0.7935709  97  
## [53] 0.05313106 0.8522590 107  
## [54] 0.06068302 0.7461959 107  
## [55] 0.06034886 0.9396627 134  
## [56] 0.06957161 0.6873575 113  
## [57] 0.06776716 0.7681077 123  
## [58] 0.08587850 0.8229402 167  
## [59] 0.09710620 0.7582957 174  
## [60] 0.11000468 0.8040284 209  
## [61] 0.12210118 0.7694305 222
summary(transactionsrules)
## set of 61 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2 
## 61 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2       2       2       2       2       2 
## 
## summary of quality measures:
##     support           confidence        coverage            lift       
##  Min.   :0.002139   Min.   :0.1000   Min.   :0.01437   Min.   :0.6458  
##  1st Qu.:0.002673   1st Qu.:0.1100   1st Qu.:0.02185   1st Qu.:0.7893  
##  Median :0.003943   Median :0.1246   Median :0.03368   Median :0.8506  
##  Mean   :0.004697   Mean   :0.1243   Mean   :0.03795   Mean   :0.8543  
##  3rd Qu.:0.005280   3rd Qu.:0.1349   3rd Qu.:0.04692   3rd Qu.:0.9139  
##  Max.   :0.014837   Max.   :0.1602   Max.   :0.12210   Max.   :1.1161  
##      count       
##  Min.   : 32.00  
##  1st Qu.: 40.00  
##  Median : 59.00  
##  Mean   : 70.28  
##  3rd Qu.: 79.00  
##  Max.   :222.00  
## 
## mining info:
##          data ntransactions support confidence
##  transactions         14963   0.002        0.1
##                                                                                           call
##  apriori(data = transactions, parameter = list(support = 0.002, confidence = 0.1, minlen = 2))

Results/Rules

To interpret a graph below, it is significant to understand three concepts:

Support

It is a measure of how often a given set of items/products appears in all transactions.

Confidence

It is the probability that if a consumer has a certain product X (lhs), they will also decide to purchase product Y (rhs) with that probability.

Lift

It is the probability that products will be bought together or separately. A lift value of 1 is the neutral point, indicating indifference between buying products together or separately. The higher the value, the higher the likelihood that consumers will purchase the items together. Conversely, the lower the value, the greater the tendency not to buy the products together.

plot(transactionsrules, method="graph", measure="support", shading="lift", engine="html")

Example about interpreting the plot:

Rule 60: {roll/buns} => {whole milk}
Support = 0.014 - There’s a 1.4% chance of finding a transaction where rolls/buns and whole milk are purchased together.
Confidence = 0.127 - If a consumer buys rolls/buns, there’s a 12.7% chance they also bought whole milk.
Lift = 0.804 - Consumers are likely to buy rolls/buns and whole milk separately, but it doesn’t have to be true. It’s very close to indicating that consumers are indifferent about buying these products together or separately.

Conclusion

In this paper, the Apriori Algorithm was employed on a market basket. The analysis yielded 61 rules, with dominant products being whole milk, other vegetables, and rolls/buns as the right-hand side (rhs). By adjusting algorithm parameters in the code, such as support and confidence, there’s a potential to discover more specific and interesting rules tailored to user’s preferences.