The following paper was done on a Kaggle dataset (link) containing 36765 transactions. It will be used in this paper to conduct a hypothetical study on grocery store strategy. Bundles of 2 or more products will be searched in order to be put on sale together. This is a common practice, called cross-selling, used by markets in order to boost sales of less desired products. As we do not possess the exact sales data of the products contained in the set, we will choose a few products from the less prevalent segment. In order to achieve the desired output, market basket analysis will be conducted with the usage of the Apriori algorithm.
library(tidyverse)
library(arules)
library(arulesViz)
library(knitr)
data <- read.csv('Groceries_dataset.csv')
head(data)
## Member_number Date itemDescription
## 1 1808 21-07-2015 tropical fruit
## 2 2552 05-01-2015 whole milk
## 3 2300 19-09-2015 pip fruit
## 4 1187 12-12-2015 other vegetables
## 5 3037 01-02-2015 whole milk
## 6 4941 14-02-2015 rolls/buns
As we can see, the data is structured in a way that implies the need of further transformation. The dataset has 3 columns - one containing the id of the customer, the second one containing the date of the transaction and the third one containing a single product. The first step will be to process the data in a way, where the output is a vector of whole daily market basket of the customer. For this purpose, tidyverse will be useful. Along the way, unwanted, excess or inconclusive data will be removed from the dataset (e.g. ‘other vegetables’ - this tells us nothing). Additionally, dates and customer id’s will be removed and replaced by integers marking the number of the transaction. Finally, we will make sure that no empty vectors are present in the column.
data <- subset(data, !grepl("other vegetables", itemDescription, ignore.case = TRUE))
data <- subset(data, !grepl("shopping bags", itemDescription, ignore.case = TRUE))
data <- subset(data, !grepl("newspapers", itemDescription, ignore.case = TRUE))
data_proc <- data %>%
group_by(Member_number, Date) %>%
summarise(items = list(itemDescription))
## `summarise()` has grouped output by 'Member_number'. You can override using the
## `.groups` argument.
data_proc$transaction_id <- seq_along(data_proc$Date)
data_proc <- data_proc[, !(names(data_proc) %in% c("Date", "Member_number"))]
data_proc <- data_proc[sapply(data_proc$items, function(vec) length(vec) > 0), ]
head(data_proc)
## # A tibble: 6 × 2
## items transaction_id
## <list> <int>
## 1 <chr [4]> 1
## 2 <chr [3]> 2
## 3 <chr [2]> 3
## 4 <chr [2]> 4
## 5 <chr [2]> 5
## 6 <chr [2]> 6
With the data processed, we can now save it as a new transactions variable.
trans <- as(data_proc$items, "transactions")
## Warning in asMethod(object): removing duplicated items in transactions
summary(trans)
## transactions as itemMatrix in sparse format with
## 14893 rows (elements/itemsets/transactions) and
## 164 columns (items) and a density of 0.01428278
##
## most frequent items:
## whole milk rolls/buns soda yogurt root vegetables
## 2363 1646 1453 1285 1041
## (Other)
## 27097
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9
## 1839 8973 2415 1056 283 157 94 68 8
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 2.000 2.342 3.000 9.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
itemFrequencyPlot(trans,topN=20,type="absolute")
From the summary of the transactions file, we can deduce that the most frequent item bought was ‘whole milk’, which seems logical. But more interesting to us will be the less popular items.
In order to construct the rules, we use the Apriori algorithm. Note that low levels of support and confidence have to be used as the frequent occurrence of ‘whole milk’ seems to distort the dataset. Additionally, the minlen parameter regulates the minimum length of rules to 2.
rules <- apriori(trans, parameter = list(support = 0.001, confidence = 0.1, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.1 0.1 1 none FALSE TRUE 5 0.001 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 14
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[164 item(s), 14893 transaction(s)] done [0.00s].
## sorting and recoding items ... [146 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [95 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
temp <- kable(inspect(head(rules)))
temp
| lhs | rhs | support | confidence | coverage | lift | count | ||
|---|---|---|---|---|---|---|---|---|
| [1] | {frozen fish} | => | {whole milk} | 0.0010743 | 0.1568627 | 0.0068489 | 0.9886402 | 16 |
| [2] | {seasonal products} | => | {rolls/buns} | 0.0010072 | 0.1415094 | 0.0071174 | 1.2803767 | 15 |
| [3] | {pot plants} | => | {whole milk} | 0.0010072 | 0.1282051 | 0.0078560 | 0.8080233 | 15 |
| [4] | {pasta} | => | {whole milk} | 0.0010743 | 0.1322314 | 0.0081246 | 0.8333992 | 16 |
| [5] | {pickled vegetables} | => | {whole milk} | 0.0010072 | 0.1119403 | 0.0089975 | 0.7055129 | 15 |
| [6] | {packaged fruit/vegetables} | => | {rolls/buns} | 0.0012086 | 0.1417323 | 0.0085275 | 1.2823930 | 18 |
We get 95 rules, with most of them containing ‘whole milk’. This means, that whole milk will be bought no matter what, therefore it will not be of interest to us. The data will need some more cleaning. All rows with ‘whole milk’ will be removed from the original dataset. Additionally, all vectors with a length below the value of 2 will be removed from the transactions file. The logic behind this is as follows: we want to see, what products we can bundle together. Therefore, we are not interested in products which are bought either way and we are not interested in products bought singularly in a single transaction.
data <- subset(data, !grepl("whole milk", itemDescription, ignore.case = TRUE))
data_proc <- data %>%
group_by(Member_number, Date) %>%
summarise(items = list(itemDescription))
## `summarise()` has grouped output by 'Member_number'. You can override using the
## `.groups` argument.
data_proc$transaction_id <- seq_along(data_proc$Date)
data_proc <- data_proc[, !(names(data_proc) %in% c("Date", "Member_number"))]
data_proc <- data_proc[sapply(data_proc$items, function(vec) length(vec) >= 2), ]
trans <- as(data_proc$items, "transactions")
## Warning in asMethod(object): removing duplicated items in transactions
summary(trans)
## transactions as itemMatrix in sparse format with
## 11973 rows (elements/itemsets/transactions) and
## 163 columns (items) and a density of 0.01525775
##
## most frequent items:
## rolls/buns soda yogurt root vegetables tropical fruit
## 1498 1322 1177 953 926
## (Other)
## 23901
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9
## 151 8191 2247 857 263 137 82 41 4
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 2.000 2.487 3.000 9.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
itemFrequencyPlot(trans,topN=20,type="absolute")
Now with ‘whole milk’ and vectors shorter then 2 removed, we can proceed to use the Apriori algorithm once again.
rules <- apriori(trans, parameter = list(support = 0.001, confidence = 0.1, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.1 0.1 1 none FALSE TRUE 5 0.001 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 11
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[163 item(s), 11973 transaction(s)] done [0.00s].
## sorting and recoding items ... [146 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [37 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
temp <- kable(head(inspect(rules)))
temp
| lhs | rhs | support | confidence | coverage | lift | count | ||
|---|---|---|---|---|---|---|---|---|
| [1] | {seasonal products} | => | {rolls/buns} | 0.0012528 | 0.1515152 | 0.0082686 | 1.2110086 | 15 |
| [2] | {canned fish} | => | {rolls/buns} | 0.0011693 | 0.1320755 | 0.0088533 | 1.0556339 | 14 |
| [3] | {pot plants} | => | {yogurt} | 0.0010858 | 0.1203704 | 0.0090203 | 1.2244643 | 13 |
| [4] | {pasta} | => | {yogurt} | 0.0011693 | 0.1250000 | 0.0093544 | 1.2715590 | 14 |
| [5] | {pasta} | => | {soda} | 0.0010023 | 0.1071429 | 0.0093544 | 0.9703642 | 12 |
| [6] | {packaged fruit/vegetables} | => | {rolls/buns} | 0.0015034 | 0.1512605 | 0.0099390 | 1.2089733 | 18 |
We can see that the rules are now much more diverse, giving us more insight into more nuanced customer preferences. What we can do now is to view the rules sorted by their confidence, support and lift parameters. The coverage parameter will not be discussed in the following part, as it is the percentage of the frequency of the LHS item appearing in the dataset divided by the dataset size.
The first parameter to inspect by us will be the support parameter. It is defined as the number of transactions containing certain items on both sides of the equation divided by the total number of transactions.
Confidence on the other hand, measures the strength of association between two items. It is calculated as the joint support of the 2 items, divided by the support of the left hand side item.
temp <- kable(head(inspect(sort(rules, by = "support", decreasing = TRUE))))
temp1 <- kable(inspect(head(sort(rules, by = "confidence", decreasing = TRUE))))
temp
| lhs | rhs | support | confidence | coverage | lift | count | ||
|---|---|---|---|---|---|---|---|---|
| [1] | {sausage} | => | {soda} | 0.0074334 | 0.1082725 | 0.0686545 | 0.9805951 | 89 |
| [2] | {sausage} | => | {yogurt} | 0.0071828 | 0.1046229 | 0.0686545 | 1.0642733 | 86 |
| [3] | {pip fruit} | => | {rolls/buns} | 0.0061806 | 0.1106129 | 0.0558757 | 0.8840906 | 74 |
| [4] | {fruit/vegetable juice} | => | {rolls/buns} | 0.0046772 | 0.1174004 | 0.0398396 | 0.9383413 | 56 |
| [5] | {frankfurter} | => | {rolls/buns} | 0.0045937 | 0.1086957 | 0.0422618 | 0.8687671 | 55 |
| [6] | {pork} | => | {rolls/buns} | 0.0042596 | 0.1013917 | 0.0420112 | 0.8103887 | 51 |
temp1
| lhs | rhs | support | confidence | coverage | lift | count | ||
|---|---|---|---|---|---|---|---|---|
| [1] | {processed cheese} | => | {rolls/buns} | 0.0018375 | 0.1538462 | 0.0119435 | 1.229639 | 22 |
| [2] | {seasonal products} | => | {rolls/buns} | 0.0012528 | 0.1515152 | 0.0082686 | 1.211009 | 15 |
| [3] | {packaged fruit/vegetables} | => | {rolls/buns} | 0.0015034 | 0.1512605 | 0.0099390 | 1.208973 | 18 |
| [4] | {red/blush wine} | => | {rolls/buns} | 0.0016704 | 0.1388889 | 0.0120271 | 1.110091 | 20 |
| [5] | {detergent} | => | {yogurt} | 0.0013363 | 0.1344538 | 0.0099390 | 1.367727 | 16 |
| [6] | {soft cheese} | => | {yogurt} | 0.0015869 | 0.1338028 | 0.0118600 | 1.361106 | 19 |
We can see, that the rule with the biggest support seems to be sausage -> soda. If we search for an empirical explanation, we can look to the grilling season. Sausages and sodas seem to be the perfect combination for a nice barbecue (authors assumption). Although it is the most frequent transaction in this dataset, the lift parameter indicates no positive correlation between the purchase of those items. That being said, values for support are low for every rule in the dataset.
The results of inspecting the rules with the highest confidence are not as informative. A similar situation takes place, where a everyday needs product (in this case ‘rolls/buns’) dominates the dataset. Note how rules with rolls/buns have a high confidence and lift. This might indicate that rolls/buns are rarely bought by themselves, usually being picked up along the way. The same can be said for the items on the left hand side of the rules. Those are most probably everyday groceries of the households.
Lift tells us, how much more likely item B is to be bought with item A than without it. It is calculated as the confidence of the transaction divided by the support of good B. For the assumed purpose of this paper, this will be the most significant measure as it can tell us how to boost sales of less desired items by bundling them with other items.
temp <- kable(inspect(head(sort(rules, by = "lift", decreasing = TRUE))))
temp
| lhs | rhs | support | confidence | coverage | lift | count | ||
|---|---|---|---|---|---|---|---|---|
| [1] | {beverages} | => | {sausage} | 0.0019210 | 0.1017699 | 0.0188758 | 1.482349 | 23 |
| [2] | {flour} | => | {tropical fruit} | 0.0013363 | 0.1142857 | 0.0116930 | 1.477692 | 16 |
| [3] | {processed cheese} | => | {root vegetables} | 0.0013363 | 0.1118881 | 0.0119435 | 1.405704 | 16 |
| [4] | {detergent} | => | {yogurt} | 0.0013363 | 0.1344538 | 0.0099390 | 1.367727 | 16 |
| [5] | {soft cheese} | => | {yogurt} | 0.0015869 | 0.1338028 | 0.0118600 | 1.361106 | 19 |
| [6] | {pasta} | => | {yogurt} | 0.0011693 | 0.1250000 | 0.0093544 | 1.271559 | 14 |
As we can see, sausage is more likely to be bought with beverages (which is very interesting while combined with the results of the Support section, we will touch on that in the Conclusions section). Tropical fruit is more likely to be bought with flour, which might indicate a nice bundle for bakers, experimenting with new cakes containing tropical fruit. Root vegetables are more likely to be bought with processed cheese, which might be a sign of the communities breakfast and dinner needs, but those ingredients may also be used for a plethora of nice meals, one of them being a potato casserole (this is just the speculation of the author).
plot(rules,
method = "graph",
limit = 20,
)
Above is a visualization of the rules, along with the lift and support of particular items or item pairs. It is not very readable, a more effective way would be to use the rulesExplorer() function from arulesViz to explore the rules with the option of zooming into the graph (it is sadly not compatible with html markdowns). What the graph is trying to portray reinforces the association of sausages and beverages, it also uncovers the unlikely but logical association of oil and soda. The story behind this pair might be dinner preferences of the populous, frying the dinner and having a soda to go with it.
plot(rules,
method = "scatterplot",
limit = 20,
)
The analysis of the scatter plot showing the relation of support, confidence and lift reveals that there aren’t really any rules for which support would be high. This might indicate that the product bundles with high confidence are made up of products that are rarely bought in general. Therefore our focus should shift to lift, which will be the basis for our conclusions.
plot(rules,
method = "matrix",
limit = 20,
)
## Itemsets in Antecedent (LHS)
## [1] "{flour}" "{soft cheese}"
## [3] "{processed cheese}" "{beverages}"
## [5] "{pasta}" "{chewing gum}"
## [7] "{pot plants}" "{seasonal products}"
## [9] "{packaged fruit/vegetables}" "{detergent}"
## [11] "{oil}" "{herbs}"
## [13] "{red/blush wine}" "{sausage}"
## [15] "{canned fish}" "{ice cream}"
## [17] "{chocolate}"
## Itemsets in Consequent (RHS)
## [1] "{rolls/buns}" "{soda}" "{yogurt}"
## [4] "{root vegetables}" "{tropical fruit}" "{sausage}"
On the last graph, we can see the visualization of a matrix, determining the lift of the rules. Lower indices tend of the LHS of the rules tend to have a higher lift with the RHS when compared with lower higher indices. That being said, while the higher indices of the RHS have a higher lift, the lower indices of it more frequently have a lift of >1.
Despite of poor results of the Apriori algorithm on the data, the principal goal of the paper has been achieved. A sales bundle has been found, with it being the pair of beverages and sausage. A representation of life imitating art, this combination seems to be the premier bundle sold in supermarkets around the spring break season. Additional bundles stemming from the higher lift might be root vegetables and cheese or tropical fruit and flour.