This paper will be focused on analysing association rules in a grocery basket dataset found on Kaggle. Association rules are used to found certain dependencies throught the set, i.e. “if a person buys bread, he is likely to a certain extent to buy butter and milk”. Finding association rules and further using them by the sales departments of companies is an important factor in the increase of revenues. Shu-hsien Liao and Hsiao-ko Chang found out that proper use of recommendation systems based on association rules proves beneficial as it actually increases the number of customers and quantities of products they buy. You can find the whole article here. Finding the association rules in this paper will be used with the help of packages, which are listed below.
library(arules)
library(arulesViz)
library(psych)
library(stringr)
library(kableExtra)
library(plotly)
Having 169 unique products we would like to find rules for one or two of them, because analysing rules for each and every would take too much time and probably strip this paper of any concrete insight. So, my product of choice for which association rules I will be looking is beer. We could learn some interesting insight from this, as for example, whether people tend to buy only beer, or with some snacks for parties, or maybe they treat it like water and juice and buy it with regular groceries.
The dataset contains 9835 transactions by customers shopping for groceries. The data contains 169 unique items. The set contained a variable describing the number of bought products as first column. I deleted it since it was treated as an item in the basket. Below you can see the preview of the dataset. Showing all the transactions via inspect function would be problematic since there are almost 10 thousand of them.
groceries <-
read.transactions(
"groceries.csv",
format = "basket",
sep = ",",
skip = 0,
header = TRUE
)
groceries
## transactions in sparse format with
## 9835 transactions (rows) and
## 169 items (columns)
LIST(head(groceries))
## [[1]]
## [1] "citrus fruit" "margarine" "ready soups"
## [4] "semi-finished bread"
##
## [[2]]
## [1] "coffee" "tropical fruit" "yogurt"
##
## [[3]]
## [1] "whole milk"
##
## [[4]]
## [1] "cream cheese" "meat spreads" "pip fruit" "yogurt"
##
## [[5]]
## [1] "condensed milk" "long life bakery product"
## [3] "other vegetables" "whole milk"
##
## [[6]]
## [1] "abrasive cleaner" "butter" "rice"
## [4] "whole milk" "yogurt"
Since there are too many transactions to get a good overview by listing them, we can make use of descriptive statistics, which will more or less tell us the distribution of the number of items in each transaction.
describe(size(groceries))
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 9835 4.41 3.59 3 3.84 2.97 1 32 31 1.64 3.8
## se
## X1 0.04
hist(size(groceries), breaks = 16)
We can see that, on average, people bought between 4 to 5 products during one transaction. However, mean is very prone to outliers, thus the median seems like a better option to consider the “average” amount of products. We can also see that there maximum number of products bought is 32 and from the histogram we can observe that the biggest fraction of transactions were 1-product ones. The parameter break decides into how many bins the histogram is split. So in this case having break=16 means that each bin “contains” two products. First bin corresponds to frequency of transactions of 1 or 2 products, second bin of 3 or 4 products and so on.
We can also inspect the frequencies of occurences of certain products. But then again, I will show only the top 30 items according to their importance, since plotting all 169 would be unreadable.
itemFrequencyPlot(
groceries,
topN = 30,
type = "relative",
main = "Item frequency",
cex.names = 0.85
)
We can observe that whole milk is the most frequent choice for customers. Next 4 positions also seem to stand out a bit from the rest, which lacks any significant drops of frequency among the products. The product of my choice is shown on the histogram in two variants: bottled beer and canned beer. To be sure there aren’t any other variants I will look for unique values of products containing the word “beer”.
uniques <- groceries@itemInfo[["labels"]]
uniques[str_detect(uniques, "beer")]
## [1] "bottled beer" "canned beer"
As we can see there are only exactly two products describing beer, so we can be sure now that no beer-related item will be omitted.
Firstly we need to create the rules in our dataset using the Apriori algorithm. Each of the rules’ quality can be measured with 3 measures, which are support, confidence and lift. More on each measure will be in the following subchapters. The apriori algorithm has default minimum values of rules’ support and confidence (0.1 and 0.8, respectively) which are too high for our dataset. No rules were found, thus we should lower the tresholds of minimum support and confidence.
rules <- apriori(groceries, parameter = list(supp = 0.01, conf = 0.45))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.45 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 98
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.01s].
## writing ... [31 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Total of 31 rules were found and that number can be taken into further analysis, although the parameters had to be set to be relatively low to initial values.
Support is measure which tells us how many often a certain set of items appeared in the whole transaction set. In other words it’s probability of appearance. From the 31 rules obtained the ones below are the top 6 according to support values.
rules_supp <- sort(rules, by = "support", decreasing = TRUE)
rules_supp_dt <- inspect(head(rules_supp), linebreak = FALSE)
kable(rules_supp_dt, "html") %>% kable_styling("striped")
lhs | rhs | support | confidence | lift | count | ||
---|---|---|---|---|---|---|---|
[1] | {domestic eggs} | => | {whole milk} | 0.0299949 | 0.4727564 | 1.850203 | 295 |
[2] | {butter} | => | {whole milk} | 0.0275547 | 0.4972477 | 1.946053 | 271 |
[3] | {curd} | => | {whole milk} | 0.0261312 | 0.4904580 | 1.919480 | 257 |
[4] | {other vegetables,root vegetables} | => | {whole milk} | 0.0231825 | 0.4892704 | 1.914833 | 228 |
[5] | {root vegetables,whole milk} | => | {other vegetables} | 0.0231825 | 0.4740125 | 2.449770 | 228 |
[6] | {other vegetables,yogurt} | => | {whole milk} | 0.0222674 | 0.5128806 | 2.007235 | 219 |
We can see that the rule with the highest support (almost 3%) is a one where a person buys domestic eggs and whole milk. It means that among the 9835 transactions 295 of them contained both domestic eggs and whole milk. All the transactions with high support values contain whole milk. All in all the most common combinations of products in the our dataset (given the parameters of Apriori algorithm) contained dairy products or vegetables.
Confidence describes how likely it is to have item B (rhs) in transaction given that item A (lhs) is in it already. It has maximum value of 1 and it is when customers always buy item B with item A.
rules_conf <- sort(rules, by = "confidence", decreasing = TRUE)
rules_conf_dt <- inspect(head(rules_conf), linebreak = FALSE)
kable(rules_conf_dt, "html") %>% kable_styling("striped")
lhs | rhs | support | confidence | lift | count | ||
---|---|---|---|---|---|---|---|
[1] | {citrus fruit,root vegetables} | => | {other vegetables} | 0.0103711 | 0.5862069 | 3.029608 | 102 |
[2] | {root vegetables,tropical fruit} | => | {other vegetables} | 0.0123030 | 0.5845411 | 3.020999 | 121 |
[3] | {curd,yogurt} | => | {whole milk} | 0.0100661 | 0.5823529 | 2.279125 | 99 |
[4] | {butter,other vegetables} | => | {whole milk} | 0.0114896 | 0.5736041 | 2.244885 | 113 |
[5] | {root vegetables,tropical fruit} | => | {whole milk} | 0.0119980 | 0.5700483 | 2.230969 | 118 |
[6] | {root vegetables,yogurt} | => | {whole milk} | 0.0145399 | 0.5629921 | 2.203354 | 143 |
From the results we can gather that if a person bought citrus fruit and root vegetable he will buy other vegetables with the likelihood of roughly 59%. Overall, the confidence levels are not too high, but it may be caused by the fact that there are many unique products, hence many combinations of them. Again most of the top confidence rules contain whole milk, which suggests that milk is bought regardless of what other products are bought.
Lift can be seen as a measure of correlation of sorts. It tells us how much more likely it is that items A and B will be bought together than when they are assumed to be unrelated. Values of lift < 1 mean that products are more likely to be bought separately than together and lift > 1 means that products are more likely to be bought together. Lift = 1 means there is no difference.
rules_lift <- sort(rules, by = "lift", decreasing = TRUE)
rules_lift_dt <- inspect(head(rules_lift), linebreak = FALSE)
kable(rules_lift_dt, "html") %>% kable_styling("striped")
lhs | rhs | support | confidence | lift | count | ||
---|---|---|---|---|---|---|---|
[1] | {citrus fruit,root vegetables} | => | {other vegetables} | 0.0103711 | 0.5862069 | 3.029608 | 102 |
[2] | {root vegetables,tropical fruit} | => | {other vegetables} | 0.0123030 | 0.5845411 | 3.020999 | 121 |
[3] | {rolls/buns,root vegetables} | => | {other vegetables} | 0.0122013 | 0.5020921 | 2.594890 | 120 |
[4] | {root vegetables,yogurt} | => | {other vegetables} | 0.0129131 | 0.5000000 | 2.584078 | 127 |
[5] | {whipped/sour cream,yogurt} | => | {other vegetables} | 0.0101678 | 0.4901961 | 2.533410 | 100 |
[6] | {root vegetables,whole milk} | => | {other vegetables} | 0.0231825 | 0.4740125 | 2.449770 | 228 |
We can see that the highest value of lift is a little above 3. It implies that other vegetables appeared three times more often in transactionss with citrus fruit and root vegetables than separately. All of the rules with highest lifts contain other vegetables as rhs. What we can infer from that is that “other vegetables” are more likely to be bought with other products (lhs list) than if they were independent. It seems that people don’t go shopping to buy vegetables only.
Since we have a general overview of rules in our set, we can try to find the ones of interest (bottled beer and canned beer). We have to lower both support and confidence values for the algorithm, due to generally lower number of transactions containing beer. The goal was to obtain approximately 5 rules for each type of package. Eventually, given exact same levels of confidence (0.175) and different support levels we obtained 6 rules each.
rules_bbeer <-
apriori(
data = groceries,
parameter = list(supp = 0.002, conf = 0.175),
appearance = list(default = "lhs", rhs = "bottled beer"),
control = list(verbose = F)
)
rules_bbeer_dt <- inspect(rules_bbeer, linebreak = FALSE)
kable(rules_bbeer_dt, "html") %>% kable_styling("striped")
lhs | rhs | support | confidence | lift | count | ||
---|---|---|---|---|---|---|---|
[1] | {liquor} | => | {bottled beer} | 0.0046772 | 0.4220183 | 5.240594 | 46 |
[2] | {red/blush wine} | => | {bottled beer} | 0.0048805 | 0.2539683 | 3.153760 | 48 |
[3] | {bottled water,fruit/vegetable juice} | => | {bottled beer} | 0.0025419 | 0.1785714 | 2.217487 | 25 |
[4] | {bottled water,soda} | => | {bottled beer} | 0.0050839 | 0.1754386 | 2.178584 | 50 |
[5] | {bottled water,whole milk} | => | {bottled beer} | 0.0061007 | 0.1775148 | 2.204366 | 60 |
[6] | {bottled water,other vegetables,whole milk} | => | {bottled beer} | 0.0024403 | 0.2264151 | 2.811607 | 24 |
We can see that the rule with highest confidence and really high lift is the one that connects bottled beer to liquors. Lift of 5 means that liquors and bottled beer have 5-times the chance of being bought together than independently. It is pretty self explanatory, since most people don’t drink much alcohol on a daily basis and such transactions are made probably for parties or some group gatherings.Two biggest values of support (rule no.4 and no.5) show that beer is also frequently bought with other liquids, such as water, soda or milk. The only food shown in those rules are vegetables, which also is quite peculiar. One would expect someone to buy beer along with some meat for grilling for example.
Beneath we can see a plot mapping the rules onto a 2D plane, where more clearly we can see that rule no.1 stands out from the rest.
plotly_arules(rules_bbeer)
plot(rules_bbeer, method="grouped")
rules_cbeer <-
apriori(
data = groceries,
parameter = list(supp = 0.001, conf = 0.175),
appearance = list(default = "lhs", rhs = "canned beer"),
control = list(verbose = F)
)
rules_cbeer_dt <- inspect(rules_cbeer, linebreak = FALSE)
kable(rules_cbeer_dt, "html") %>% kable_styling("striped")
lhs | rhs | support | confidence | lift | count | ||
---|---|---|---|---|---|---|---|
[1] | {liquor (appetizer)} | => | {canned beer} | 0.0017285 | 0.2179487 | 2.805662 | 17 |
[2] | {chicken,soda} | => | {canned beer} | 0.0015252 | 0.1829268 | 2.354824 | 15 |
[3] | {coffee,soda} | => | {canned beer} | 0.0019319 | 0.1938776 | 2.495793 | 19 |
[4] | {sausage,shopping bags,soda} | => | {canned beer} | 0.0010168 | 0.1785714 | 2.298757 | 10 |
[5] | {rolls/buns,sausage,shopping bags} | => | {canned beer} | 0.0014235 | 0.2372881 | 3.054619 | 14 |
[6] | {rolls/buns,shopping bags,soda} | => | {canned beer} | 0.0015252 | 0.2419355 | 3.114444 | 15 |
As we can clearly see, canned beer is generally less popular among customers, support levels of rules are lower. Similarly to bottled beer it is bought often enough with other liquors to have it’s own rule, but it isn’t leading in any of the measures this time. Highest support goes to rule no.3, which contains also coffee and soda, so perhaps canned beer is treated more like a “everyday drink”, than bottled beer. This may also be connected to the fact that 4 out of 6 rules contain food, compared to 1 in previous calculations. Not only it is more common it is somehow, more often associated with beer. Sausages, buns, sodas and canned beer form a pretty popular trio for homemade barbecues or bonfires.
On the plot below we can see that the sausage, buns and canned beer bundle rules are the ones with highest confidence and lift, a little smaller values again for liquors.
plotly_arules(rules_cbeer)
plot(rules_cbeer, method="grouped")
Generally speaking, association rules regarding bottled and canned beer proved to be plausible. Bottled beer was bought the most with other liquors or drinks, while canned beer was associated more with food. It leads me to believe that there might be a distinction between them, i.e. bottled beer being “party beer”, whereas canned beer being more of a “barbecue/bonfire beer”.
Whatever the distinctions, shopkeepers could obviously benefit from having such knowledge. They could either bundle the products according to the generated rules. They could give discounts on product A, when you have product B in your shopping cart. Or even something as simple and low-cost as rearranging the shelves in the shop so that beer is near to liquors (it’s like this already, well done shopkeepers) or move the barbecue equipment/necessities toward the beer shelves. Such subliminal messages to the customers’ minds would probably increase the number of sold products by eliminating “lazy” customers who don’t want to go to another aisle for something they remembered just now, while looking at this certain shelf.