Association rules are a powerful data mining technique that helps to uncover relationships between variables in datasets. In the context of a grocery dataset, it is possible to identify frequently purchased items and any associations between them. This information can then be used to inform sales and marketing strategies, such as creating product bundles, cross-selling, and up-selling. We will start by loading the data, then preprocessing it, generating rules, visualizing the results, and finally, analyzing and interpreting the findings. The ultimate goal of this analysis could be to provide recommendations that can be used to develop marketing strategies. Dataset link: https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset
First let’s load the data and packages then take a look at what we’re working with.
# reading the packages
library(arules)
## Ładowanie wymaganego pakietu: Matrix
##
## Dołączanie pakietu: 'arules'
## Następujące obiekty zostały zakryte z 'package:base':
##
## abbreviate, write
library(arulesViz)
library(arulesCBA)
trans = read.transactions('data/Groceries_dataset.csv', format = "single", sep = ",", cols = c("Member_number", "itemDescription"), header = TRUE)
inspect(trans[1:10])
## items transactionID
## [1] {canned beer,
## hygiene articles,
## misc. beverages,
## pastry,
## pickled vegetables,
## salty snack,
## sausage,
## semi-finished bread,
## soda,
## whole milk,
## yogurt} 1000
## [2] {beef,
## curd,
## frankfurter,
## rolls/buns,
## sausage,
## soda,
## whipped/sour cream,
## white bread,
## whole milk} 1001
## [3] {butter,
## butter milk,
## frozen vegetables,
## other vegetables,
## specialty chocolate,
## sugar,
## tropical fruit,
## whole milk} 1002
## [4] {dental care,
## detergent,
## frozen meals,
## rolls/buns,
## root vegetables,
## sausage} 1003
## [5] {canned beer,
## chocolate,
## cling film/bags,
## dish cleaner,
## frozen fish,
## hygiene articles,
## other vegetables,
## packaged fruit/vegetables,
## pastry,
## pip fruit,
## red/blush wine,
## rolls/buns,
## root vegetables,
## shopping bags,
## tropical fruit,
## whole milk} 1004
## [6] {margarine,
## rolls/buns,
## whipped/sour cream} 1005
## [7] {bottled beer,
## bottled water,
## chicken,
## chocolate,
## flour,
## frankfurter,
## rice,
## rolls/buns,
## shopping bags,
## skin care,
## softener,
## whole milk} 1006
## [8] {dessert,
## domestic eggs,
## hamburger meat,
## liquor (appetizer),
## liver loaf,
## photo/film,
## root vegetables,
## soda,
## tropical fruit,
## white wine,
## yogurt} 1008
## [9] {canned fish,
## cocoa drinks,
## herbs,
## ketchup,
## newspapers,
## pastry,
## tropical fruit,
## yogurt} 1009
## [10] {bottled water,
## candles,
## coffee,
## frankfurter,
## kitchen towels,
## pip fruit,
## rolls/buns,
## sliced cheese,
## specialty bar,
## UHT-milk} 1010
We can notice that the transactions are quite big and diverse which could mean the analysis will be full of interesting insights. Let’s check the frequency of items in transactions relative to each other.
itemFrequencyPlot(trans, topN=15, type="relative", main="Grocery item frequency")
We see that milk is in over 40% of the transactions! Vegetables, bakings and soda or yogurts are also all very high in frequency. It is quite expected as these items are food and drink items necessary for survival and bought very often. We should also investigate the summary of this data.
summary(trans)
## transactions as itemMatrix in sparse format with
## 3898 rows (elements/itemsets/transactions) and
## 167 columns (items) and a density of 0.05340678
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 1786 1468 1363 1222
## yogurt (Other)
## 1103 27824
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 6 248 87 331 261 381 303 332 340 296 276 238 181 179 123 97 66 46 39 28
## 21 22 23 24 25 26
## 15 13 3 5 2 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 6.000 8.500 8.919 12.000 26.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
##
## includes extended transaction information - examples:
## transactionID
## 1 1000
## 2 1001
## 3 1002
Here we can see the absolute number of transaction of the top items and some basic descriptive statistics about this dataset. Let’s move on to
This data doesn’t really need any preprocessing as the read.transaction function practically did most of the hard work for us, so let’s move on to anaysis.
rules<-apriori(trans)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 389
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[167 item(s), 3898 transaction(s)] done [0.00s].
## sorting and recoding items ... [29 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
The results of the Apriori algorithm indicate that it did not generate any rules. This means that the algorithm did not find any significant associations or patterns between items in the transactions data. The reason probably is that the minimum support count was set too high. It determines the minimum number of transactions that an itemset must appear in to be considered significant. If it was set too high, it could have filtered out all possible rules, leading to the result of no rules generated. Let’s try a lower support and confidence threshold.
rules <- apriori(trans, parameter=list(supp=0.03, conf=0.60, minlen=2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.03 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 116
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[167 item(s), 3898 transaction(s)] done [0.00s].
## sorting and recoding items ... [72 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [8 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
These results show that the algorithm was able to generate 8 rules from the data! This means that we have found some significant associations between items in the data! Compared to the previous results, by setting a lower minimum support and confidence threshold, the algorithm was able to find more rules and uncover some patterns in the data. Now that we have some rules to work with we can plot them.
set.seed(42)
plot(rules, method="graph", measure="support", shading="lift", main="Grocery rules")
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
Here we can see the acquired rules with color being lift and size being support. Let’s see some other plots as well.
plot(rules, method="paracoord", control=list(reorder=TRUE))
plot(rules, shading="order", control=list(main="Two-key plot"))
With these plots we can notice that every rule ends with milk in our case which is not surprising as it is the most common item transactions, but we can also notice the longer rule chains leading to it. We should inspect the rules in table form.
The support metric represents the frequency of occurrence of the lhs and rhs items together in the transactions. A higher support value indicates that the items occur more frequently together in transactions.
inspect(sort(rules, by = "support"), linebreak = FALSE)
## lhs rhs support
## [1] {rolls/buns, shopping bags} => {whole milk} 0.04130323
## [2] {bottled water, yogurt} => {whole milk} 0.04027707
## [3] {bottled beer, rolls/buns} => {whole milk} 0.03822473
## [4] {pastry, yogurt} => {whole milk} 0.03488969
## [5] {other vegetables, rolls/buns, yogurt} => {whole milk} 0.03437660
## [6] {shopping bags, yogurt} => {whole milk} 0.03309389
## [7] {other vegetables, rolls/buns, soda} => {whole milk} 0.03181119
## [8] {beef, other vegetables} => {whole milk} 0.03052848
## confidence coverage lift count
## [1] 0.6007463 0.06875321 1.311147 161
## [2] 0.6061776 0.06644433 1.323001 157
## [3] 0.6056911 0.06310929 1.321939 149
## [4] 0.6017699 0.05797845 1.313381 136
## [5] 0.6568627 0.05233453 1.433623 134
## [6] 0.6028037 0.05489995 1.315638 129
## [7] 0.6048780 0.05259107 1.320165 124
## [8] 0.6010101 0.05079528 1.311723 119
The confidence metric represents the proportion of transactions containing the lhs items that also contain the rhs item. A higher confidence value indicates a stronger relationship between the lhs and rhs items.
inspect(sort(rules, by = "confidence"), linebreak = FALSE)
## lhs rhs support
## [1] {other vegetables, rolls/buns, yogurt} => {whole milk} 0.03437660
## [2] {bottled water, yogurt} => {whole milk} 0.04027707
## [3] {bottled beer, rolls/buns} => {whole milk} 0.03822473
## [4] {other vegetables, rolls/buns, soda} => {whole milk} 0.03181119
## [5] {shopping bags, yogurt} => {whole milk} 0.03309389
## [6] {pastry, yogurt} => {whole milk} 0.03488969
## [7] {beef, other vegetables} => {whole milk} 0.03052848
## [8] {rolls/buns, shopping bags} => {whole milk} 0.04130323
## confidence coverage lift count
## [1] 0.6568627 0.05233453 1.433623 134
## [2] 0.6061776 0.06644433 1.323001 157
## [3] 0.6056911 0.06310929 1.321939 149
## [4] 0.6048780 0.05259107 1.320165 124
## [5] 0.6028037 0.05489995 1.315638 129
## [6] 0.6017699 0.05797845 1.313381 136
## [7] 0.6010101 0.05079528 1.311723 119
## [8] 0.6007463 0.06875321 1.311147 161
The lift metric represents the strength of association between the lhs and rhs items, compared to their expected occurrence if they were independent of each other. A lift value greater than 1 indicates a positive association between the items, while a lift value less than 1 indicates a negative association.
inspect(sort(rules, by = "lift"), linebreak = FALSE)
## lhs rhs support
## [1] {other vegetables, rolls/buns, yogurt} => {whole milk} 0.03437660
## [2] {bottled water, yogurt} => {whole milk} 0.04027707
## [3] {bottled beer, rolls/buns} => {whole milk} 0.03822473
## [4] {other vegetables, rolls/buns, soda} => {whole milk} 0.03181119
## [5] {shopping bags, yogurt} => {whole milk} 0.03309389
## [6] {pastry, yogurt} => {whole milk} 0.03488969
## [7] {beef, other vegetables} => {whole milk} 0.03052848
## [8] {rolls/buns, shopping bags} => {whole milk} 0.04130323
## confidence coverage lift count
## [1] 0.6568627 0.05233453 1.433623 134
## [2] 0.6061776 0.06644433 1.323001 157
## [3] 0.6056911 0.06310929 1.321939 149
## [4] 0.6048780 0.05259107 1.320165 124
## [5] 0.6028037 0.05489995 1.315638 129
## [6] 0.6017699 0.05797845 1.313381 136
## [7] 0.6010101 0.05079528 1.311723 119
## [8] 0.6007463 0.06875321 1.311147 161
These tables pretty much look as expected. Sorting by different metrics doesn’t really give us any more insight. We should perform our own analysis with some assumptions and questions to be answered.
rules.milk<-apriori(data=trans, parameter=list(supp=0.001, conf=0.05, minlen=2), appearance=list(default="rhs",lhs="whole milk"), control=list(verbose=F))
rules.milk.byconf<-sort(rules.milk, by="confidence", decreasing=TRUE)
inspect(head(rules.milk.byconf))
## lhs rhs support confidence coverage lift
## [1] {whole milk} => {other vegetables} 0.1913802 0.4176932 0.4581837 1.109106
## [2] {whole milk} => {rolls/buns} 0.1785531 0.3896976 0.4581837 1.114484
## [3] {whole milk} => {soda} 0.1511031 0.3297872 0.4581837 1.051973
## [4] {whole milk} => {yogurt} 0.1505900 0.3286674 0.4581837 1.161510
## [5] {whole milk} => {tropical fruit} 0.1164700 0.2541993 0.4581837 1.087672
## [6] {whole milk} => {root vegetables} 0.1131349 0.2469205 0.4581837 1.070630
## count
## [1] 746
## [2] 696
## [3] 589
## [4] 587
## [5] 454
## [6] 441
Unsurprisingly, buying milk gives the same items as in the case where it was the other way around.
rules.pet<-apriori(data=trans, parameter=list(supp=0.001, conf=0.005, minlen=2), appearance=list(default="lhs",rhs="pet care"), control=list(verbose=F))
rules.pet.byconf<-sort(rules.pet, by="confidence", decreasing=TRUE)
inspect(head(rules.pet.byconf), linebreak = FALSE)
## lhs rhs
## [1] {baking powder, citrus fruit, other vegetables, pastry} => {pet care}
## [2] {baking powder, citrus fruit, pastry} => {pet care}
## [3] {long life bakery product, pastry, rolls/buns, soda} => {pet care}
## [4] {baking powder, other vegetables, pastry, whole milk} => {pet care}
## [5] {citrus fruit, frankfurter, pastry, soda} => {pet care}
## [6] {brown bread, other vegetables, rolls/buns, soda, yogurt} => {pet care}
## support confidence coverage lift count
## [1] 0.001026167 0.6666667 0.001539251 30.57255 4
## [2] 0.001026167 0.5714286 0.001795793 26.20504 4
## [3] 0.001026167 0.3333333 0.003078502 15.28627 4
## [4] 0.001026167 0.3076923 0.003335044 14.11041 4
## [5] 0.001026167 0.3076923 0.003335044 14.11041 4
## [6] 0.001282709 0.2941176 0.004361211 13.48789 5
Interestingly, there is some association between baking powder and citrus fruits with pet care items? It is hard to explain this result.
rules.coffee<-apriori(data=trans, parameter=list(supp=0.001, conf=0.005, minlen=2), appearance=list(default="lhs",rhs="instant coffee"), control=list(verbose=F))
rules.coffee.byconf<-sort(rules.coffee, by="confidence", decreasing=TRUE)
inspect(head(rules.coffee.byconf), linebreak = FALSE)
## lhs
## [1] {beef, butter milk, other vegetables, root vegetables}
## [2] {mayonnaise, rolls/buns, yogurt}
## [3] {newspapers, other vegetables, rolls/buns, tropical fruit, yogurt}
## [4] {long life bakery product, root vegetables, shopping bags}
## [5] {bottled water, frankfurter, other vegetables, rolls/buns, whole milk, yogurt}
## [6] {chewing gum, margarine, other vegetables}
## rhs support confidence coverage lift count
## [1] => {instant coffee} 0.001026167 0.5000000 0.002052335 33.03390 4
## [2] => {instant coffee} 0.001026167 0.4000000 0.002565418 26.42712 4
## [3] => {instant coffee} 0.001026167 0.3636364 0.002821960 24.02465 4
## [4] => {instant coffee} 0.001026167 0.3076923 0.003335044 20.32855 4
## [5] => {instant coffee} 0.001026167 0.3076923 0.003335044 20.32855 4
## [6] => {instant coffee} 0.001026167 0.2857143 0.003591585 18.87651 4
Apparently, people who buy newspapers or chewing gum buy instant coffee as well. Maybe they read the newspaper while drinking coffee and chew a gum after to get rid of coffee breath.
rules.liquor<-apriori(data=trans, parameter=list(supp=0.001, conf=0.005, minlen=2), appearance=list(default="rhs",lhs="liquor"), control=list(verbose=F))
rules.liquor.byconf<-sort(rules.liquor, by="confidence", decreasing=TRUE)
inspect(head(rules.liquor.byconf, 15), linebreak = FALSE)
## lhs rhs support confidence coverage lift
## [1] {liquor} => {whole milk} 0.016675218 0.6310680 0.02642381 1.377325
## [2] {liquor} => {other vegetables} 0.012827091 0.4854369 0.02642381 1.288987
## [3] {liquor} => {rolls/buns} 0.011287840 0.4271845 0.02642381 1.221691
## [4] {liquor} => {soda} 0.010774756 0.4077670 0.02642381 1.300717
## [5] {liquor} => {yogurt} 0.010261673 0.3883495 0.02642381 1.372426
## [6] {liquor} => {tropical fruit} 0.009235505 0.3495146 0.02642381 1.495508
## [7] {liquor} => {root vegetables} 0.008209338 0.3106796 0.02642381 1.347085
## [8] {liquor} => {sausage} 0.007439713 0.2815534 0.02642381 1.366744
## [9] {liquor} => {bottled water} 0.007439713 0.2815534 0.02642381 1.317521
## [10] {liquor} => {shopping bags} 0.006926629 0.2621359 0.02642381 1.557631
## [11] {liquor} => {newspapers} 0.006157004 0.2330097 0.02642381 1.666554
## [12] {liquor} => {bottled beer} 0.005387378 0.2038835 0.02642381 1.283906
## [13] {liquor} => {citrus fruit} 0.005387378 0.2038835 0.02642381 1.099222
## [14] {liquor} => {chicken} 0.005130836 0.1941748 0.02642381 1.930850
## [15] {liquor} => {brown bread} 0.005130836 0.1941748 0.02642381 1.428100
## count
## [1] 65
## [2] 50
## [3] 44
## [4] 42
## [5] 40
## [6] 36
## [7] 32
## [8] 29
## [9] 29
## [10] 27
## [11] 24
## [12] 21
## [13] 21
## [14] 20
## [15] 20
Usually normal day-to-day products. Though tropical and citrus fruits are interesting, maybe they make cocktails with them? Let’s ask one last interesting question.
rules.frozen<-apriori(data=trans, parameter=list(supp=0.001, conf=0.005, minlen=2), appearance=list(default="rhs",lhs="frozen meals"), control=list(verbose=F))
rules.frozen.byconf<-sort(rules.frozen, by="confidence", decreasing=TRUE)
inspect(head(rules.frozen.byconf, 20), linebreak = FALSE)
## lhs rhs support confidence coverage
## [1] {frozen meals} => {whole milk} 0.03258081 0.5183673 0.06285274
## [2] {frozen meals} => {rolls/buns} 0.02821960 0.4489796 0.06285274
## [3] {frozen meals} => {other vegetables} 0.02770652 0.4408163 0.06285274
## [4] {frozen meals} => {yogurt} 0.02077989 0.3306122 0.06285274
## [5] {frozen meals} => {soda} 0.02077989 0.3306122 0.06285274
## [6] {frozen meals} => {tropical fruit} 0.01770139 0.2816327 0.06285274
## [7] {frozen meals} => {sausage} 0.01590559 0.2530612 0.06285274
## [8] {frozen meals} => {bottled water} 0.01539251 0.2448980 0.06285274
## [9] {frozen meals} => {root vegetables} 0.01462288 0.2326531 0.06285274
## [10] {frozen meals} => {citrus fruit} 0.01436634 0.2285714 0.06285274
## [11] {frozen meals} => {canned beer} 0.01205747 0.1918367 0.06285274
## [12] {frozen meals} => {pip fruit} 0.01180092 0.1877551 0.06285274
## [13] {frozen meals} => {pastry} 0.01180092 0.1877551 0.06285274
## [14] {frozen meals} => {bottled beer} 0.01154438 0.1836735 0.06285274
## [15] {frozen meals} => {whipped/sour cream} 0.01128784 0.1795918 0.06285274
## [16] {frozen meals} => {brown bread} 0.01103130 0.1755102 0.06285274
## [17] {frozen meals} => {shopping bags} 0.01103130 0.1755102 0.06285274
## [18] {frozen meals} => {fruit/vegetable juice} 0.01077476 0.1714286 0.06285274
## [19] {frozen meals} => {newspapers} 0.01051821 0.1673469 0.06285274
## [20] {frozen meals} => {domestic eggs} 0.01026167 0.1632653 0.06285274
## lift count
## [1] 1.131353 127
## [2] 1.284022 110
## [3] 1.170505 108
## [4] 1.168383 81
## [5] 1.054604 81
## [6] 1.205054 69
## [7] 1.228434 62
## [8] 1.145993 60
## [9] 1.008767 57
## [10] 1.232326 56
## [11] 1.161148 47
## [12] 1.100555 46
## [13] 1.057615 46
## [14] 1.156638 45
## [15] 1.160944 44
## [16] 1.290828 43
## [17] 1.042894 43
## [18] 1.372133 42
## [19] 1.196914 41
## [20] 1.226220 40
Seems like they buy ordinary items as well. Though there is some transactions with beer in them. After answering some questions to get some interesting insight we can conclude this work.
The analysis of the grocery dataset through association rules provided valuable insights into the purchasing patterns of customers. The application of the association rule mining technique allowed us to answer interesting questions and uncover relationships between different items. Additionally, the results were effectively visualized through various plots, making it easier to understand and interpret the findings. However, the dataset has some limitations, such as its specificity in terms of item groups, with some of them being very similar, making it challenging to properly analyze the situation. Despite this, the analysis serves as a foundation for future studies and provides valuable information for the optimization of retail operations.