The aim of this study is to use association rules to identify various patterns and dependencies related to consumer choices in their market baskets at a grocery store. The data is sourced from Kaggle link, and the dataset used is located in Data Sources -> Groceries dataset -> Groceries_dataset.csv. Due to the requirements of the functions used, transactions have been preliminarily grouped by transaction date and customer to ultimately obtain a text file where each row corresponds to a single market basket. Python was used for data transformation.
library(arules)
library(arulesViz)
Firstly, let’s load the data and take a look at the statistics:
transactions<-read.transactions("transactions.txt", format="basket",
sep=",", skip=0, quote="", rm.duplicates = FALSE)
summary(transactions)
## transactions as itemMatrix in sparse format with
## 14963 rows (elements/itemsets/transactions) and
## 167 columns (items) and a density of 0.01520957
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2363 1827 1646 1453
## yogurt (Other)
## 1285 29432
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10
## 205 10012 2727 1273 338 179 113 96 19 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 2.00 2.54 3.00 10.00
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
inspect(transactions[1:10])
## items
## [1] {sausage, semi-finished bread, whole milk, yogurt}
## [2] {pastry, salty snack, whole milk}
## [3] {canned beer, misc. beverages}
## [4] {hygiene articles, sausage}
## [5] {pickled vegetables, soda}
## [6] {curd, frankfurter}
## [7] {rolls/buns, sausage, whole milk}
## [8] {soda, whole milk}
## [9] {beef, white bread}
## [10] {frankfurter, soda, whipped/sour cream}
As seen above, we have 14963 transactions and 167 different products.
The most frequently purchased item is whole milk, and consumers most
commonly buy two products per transaction. Additionally, ten sample
baskets have been presented.
Below are charts presenting item frequency for 20 the most popular products - in relative and absolute terms:
itemFrequencyPlot(transactions, type = "relative", topN = 20, col = "skyblue", main = "Item Frequency - Relative")
itemFrequencyPlot(transactions, type = "absolute", topN = 20, col = "lightgreen", main = "Item Frequency - Absolute")
Now let’s move on to the Apriori Algorithm. The support level was set to 0.002, confidence to 0.1 and min length of the rules is 2. The thresholds were set as mentioned above because, for higher support or confidence levels, the algorithm either did not find any rules, or the number of rules was too small, leading to results that lacked meaningful interpretation. In the end, we obtained 61 rules, all of which are two-element rules. The algorithm results are as follows:
transactionsrules <- apriori(transactions, parameter = list(support = 0.002, confidence = 0.1, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.1 0.1 1 none FALSE TRUE 5 0.002 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 29
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[167 item(s), 14963 transaction(s)] done [0.00s].
## sorting and recoding items ... [126 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [61 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(transactionsrules)
## lhs rhs support confidence
## [1] {candy} => {whole milk} 0.002138609 0.1488372
## [2] {meat} => {other vegetables} 0.002138609 0.1269841
## [3] {meat} => {whole milk} 0.002205440 0.1309524
## [4] {ham} => {whole milk} 0.002740092 0.1601562
## [5] {frozen meals} => {other vegetables} 0.002138609 0.1274900
## [6] {sugar} => {whole milk} 0.002472766 0.1396226
## [7] {long life bakery product} => {whole milk} 0.002405935 0.1343284
## [8] {waffles} => {whole milk} 0.002606429 0.1407942
## [9] {salty snack} => {other vegetables} 0.002205440 0.1174377
## [10] {onions} => {whole milk} 0.002940587 0.1452145
## [11] {UHT-milk} => {other vegetables} 0.002138609 0.1000000
## [12] {UHT-milk} => {whole milk} 0.002539598 0.1187500
## [13] {berries} => {other vegetables} 0.002673261 0.1226994
## [14] {berries} => {whole milk} 0.002272272 0.1042945
## [15] {hamburger meat} => {other vegetables} 0.002205440 0.1009174
## [16] {hamburger meat} => {whole milk} 0.003074250 0.1406728
## [17] {dessert} => {whole milk} 0.002405935 0.1019830
## [18] {napkins} => {whole milk} 0.002405935 0.1087613
## [19] {cream cheese} => {whole milk} 0.002873755 0.1214689
## [20] {chocolate} => {rolls/buns} 0.002806924 0.1189802
## [21] {chocolate} => {whole milk} 0.002940587 0.1246459
## [22] {white bread} => {other vegetables} 0.002606429 0.1086351
## [23] {white bread} => {whole milk} 0.003141081 0.1309192
## [24] {chicken} => {rolls/buns} 0.002873755 0.1031175
## [25] {chicken} => {whole milk} 0.003408407 0.1223022
## [26] {frozen vegetables} => {other vegetables} 0.003141081 0.1121718
## [27] {frozen vegetables} => {whole milk} 0.003809397 0.1360382
## [28] {coffee} => {whole milk} 0.003809397 0.1205074
## [29] {margarine} => {whole milk} 0.004076723 0.1265560
## [30] {beef} => {whole milk} 0.004678206 0.1377953
## [31] {fruit/vegetable juice} => {rolls/buns} 0.003742565 0.1100196
## [32] {fruit/vegetable juice} => {whole milk} 0.004410880 0.1296660
## [33] {curd} => {other vegetables} 0.003542070 0.1051587
## [34] {curd} => {whole milk} 0.004143554 0.1230159
## [35] {butter} => {whole milk} 0.004678206 0.1328273
## [36] {pork} => {other vegetables} 0.003943060 0.1063063
## [37] {pork} => {whole milk} 0.005012364 0.1351351
## [38] {domestic eggs} => {whole milk} 0.005279690 0.1423423
## [39] {brown bread} => {whole milk} 0.004477712 0.1190053
## [40] {newspapers} => {whole milk} 0.005613847 0.1443299
## [41] {frankfurter} => {other vegetables} 0.005146027 0.1362832
## [42] {frankfurter} => {whole milk} 0.005279690 0.1398230
## [43] {whipped/sour cream} => {whole milk} 0.004611375 0.1055046
## [44] {bottled beer} => {other vegetables} 0.004678206 0.1032448
## [45] {bottled beer} => {whole milk} 0.007150972 0.1578171
## [46] {canned beer} => {whole milk} 0.006014837 0.1282051
## [47] {shopping bags} => {other vegetables} 0.004945532 0.1039326
## [48] {shopping bags} => {whole milk} 0.006348994 0.1334270
## [49] {pip fruit} => {rolls/buns} 0.004945532 0.1008174
## [50] {pip fruit} => {other vegetables} 0.004945532 0.1008174
## [51] {pip fruit} => {whole milk} 0.006616320 0.1348774
## [52] {pastry} => {whole milk} 0.006482657 0.1253230
## [53] {citrus fruit} => {whole milk} 0.007150972 0.1345912
## [54] {bottled water} => {whole milk} 0.007150972 0.1178414
## [55] {sausage} => {whole milk} 0.008955423 0.1483942
## [56] {root vegetables} => {whole milk} 0.007551962 0.1085495
## [57] {tropical fruit} => {whole milk} 0.008220277 0.1213018
## [58] {yogurt} => {whole milk} 0.011160863 0.1299611
## [59] {soda} => {whole milk} 0.011628684 0.1197522
## [60] {rolls/buns} => {whole milk} 0.013967787 0.1269745
## [61] {other vegetables} => {whole milk} 0.014836597 0.1215107
## coverage lift count
## [1] 0.01436878 0.9424677 32
## [2] 0.01684154 1.0399910 32
## [3] 0.01684154 0.8292173 33
## [4] 0.01710887 1.0141422 41
## [5] 0.01677471 1.0441344 32
## [6] 0.01771035 0.8841192 37
## [7] 0.01791085 0.8505947 36
## [8] 0.01851233 0.8915379 39
## [9] 0.01877966 0.9618066 33
## [10] 0.02024995 0.9195281 44
## [11] 0.02138609 0.8189929 32
## [12] 0.02138609 0.7519493 38
## [13] 0.02178707 1.0048992 40
## [14] 0.02178707 0.6604140 34
## [15] 0.02185391 0.8265066 33
## [16] 0.02185391 0.8907689 46
## [17] 0.02359153 0.6457773 36
## [18] 0.02212123 0.6886990 36
## [19] 0.02365836 0.7691661 43
## [20] 0.02359153 1.0815919 42
## [21] 0.02359153 0.7892833 44
## [22] 0.02399251 0.8897137 39
## [23] 0.02399251 0.8290073 47
## [24] 0.02786874 0.9373920 43
## [25] 0.02786874 0.7744423 51
## [26] 0.02800241 0.9186794 47
## [27] 0.02800241 0.8614217 57
## [28] 0.03161131 0.7630775 57
## [29] 0.03221279 0.8013786 61
## [30] 0.03395041 0.8725479 70
## [31] 0.03401724 1.0001361 56
## [32] 0.03401724 0.8210717 66
## [33] 0.03368308 0.8612425 53
## [34] 0.03368308 0.7789617 62
## [35] 0.03522021 0.8410898 70
## [36] 0.03709149 0.8706411 59
## [37] 0.03709149 0.8557034 75
## [38] 0.03709149 0.9013409 79
## [39] 0.03762614 0.7535661 67
## [40] 0.03889594 0.9139265 84
## [41] 0.03775981 1.1161496 77
## [42] 0.03775981 0.8853879 79
## [43] 0.04370781 0.6680767 69
## [44] 0.04531177 0.8455679 70
## [45] 0.04531177 0.9993303 107
## [46] 0.04691573 0.8118211 90
## [47] 0.04758404 0.8512005 74
## [48] 0.04758404 0.8448869 95
## [49] 0.04905433 0.9164832 74
## [50] 0.04905433 0.8256876 74
## [51] 0.04905433 0.8540712 99
## [52] 0.05172759 0.7935709 97
## [53] 0.05313106 0.8522590 107
## [54] 0.06068302 0.7461959 107
## [55] 0.06034886 0.9396627 134
## [56] 0.06957161 0.6873575 113
## [57] 0.06776716 0.7681077 123
## [58] 0.08587850 0.8229402 167
## [59] 0.09710620 0.7582957 174
## [60] 0.11000468 0.8040284 209
## [61] 0.12210118 0.7694305 222
summary(transactionsrules)
## set of 61 rules
##
## rule length distribution (lhs + rhs):sizes
## 2
## 61
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 2 2 2 2 2
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.002139 Min. :0.1000 Min. :0.01437 Min. :0.6458
## 1st Qu.:0.002673 1st Qu.:0.1100 1st Qu.:0.02185 1st Qu.:0.7893
## Median :0.003943 Median :0.1246 Median :0.03368 Median :0.8506
## Mean :0.004697 Mean :0.1243 Mean :0.03795 Mean :0.8543
## 3rd Qu.:0.005280 3rd Qu.:0.1349 3rd Qu.:0.04692 3rd Qu.:0.9139
## Max. :0.014837 Max. :0.1602 Max. :0.12210 Max. :1.1161
## count
## Min. : 32.00
## 1st Qu.: 40.00
## Median : 59.00
## Mean : 70.28
## 3rd Qu.: 79.00
## Max. :222.00
##
## mining info:
## data ntransactions support confidence
## transactions 14963 0.002 0.1
## call
## apriori(data = transactions, parameter = list(support = 0.002, confidence = 0.1, minlen = 2))
To interpret a graph below, it is significant to understand three concepts:
It is a measure of how often a given set of items/products appears in all transactions.
It is the probability that if a consumer has a certain product X (lhs), they will also decide to purchase product Y (rhs) with that probability.
It is the probability that products will be bought together or separately. A lift value of 1 is the neutral point, indicating indifference between buying products together or separately. The higher the value, the higher the likelihood that consumers will purchase the items together. Conversely, the lower the value, the greater the tendency not to buy the products together.
plot(transactionsrules, method="graph", measure="support", shading="lift", engine="html")
Rule 60: {roll/buns} => {whole milk}
Support =
0.014 - There’s a 1.4% chance of finding a transaction where
rolls/buns and whole milk are purchased together.
Confidence =
0.127 - If a consumer buys rolls/buns, there’s a 12.7% chance they
also bought whole milk.
Lift = 0.804 - Consumers are likely
to buy rolls/buns and whole milk separately, but it doesn’t have to be
true. It’s very close to indicating that consumers are indifferent about
buying these products together or separately.
In this paper, the Apriori Algorithm was employed on a market basket. The analysis yielded 61 rules, with dominant products being whole milk, other vegetables, and rolls/buns as the right-hand side (rhs). By adjusting algorithm parameters in the code, such as support and confidence, there’s a potential to discover more specific and interesting rules tailored to user’s preferences.