Introduction

This paper will be focused on analysing association rules in a grocery basket dataset found on Kaggle. Association rules are used to found certain dependencies throught the set, i.e. “if a person buys bread, he is likely to a certain extent to buy butter and milk”. Finding association rules and further using them by the sales departments of companies is an important factor in the increase of revenues. Shu-hsien Liao and Hsiao-ko Chang found out that proper use of recommendation systems based on association rules proves beneficial as it actually increases the number of customers and quantities of products they buy. You can find the whole article here. Finding the association rules in this paper will be used with the help of packages, which are listed below.

library(arules)
library(arulesViz)
library(psych)
library(stringr)
library(kableExtra)
library(plotly)

Having 169 unique products we would like to find rules for one or two of them, because analysing rules for each and every would take too much time and probably strip this paper of any concrete insight. So, my product of choice for which association rules I will be looking is beer. We could learn some interesting insight from this, as for example, whether people tend to buy only beer, or with some snacks for parties, or maybe they treat it like water and juice and buy it with regular groceries.


Dataset

The dataset contains 9835 transactions by customers shopping for groceries. The data contains 169 unique items. The set contained a variable describing the number of bought products as first column. I deleted it since it was treated as an item in the basket. Below you can see the preview of the dataset. Showing all the transactions via inspect function would be problematic since there are almost 10 thousand of them.

groceries <-
  read.transactions(
    "groceries.csv",
    format = "basket",
    sep = ",",
    skip = 0,
    header = TRUE
  )
groceries
## transactions in sparse format with
##  9835 transactions (rows) and
##  169 items (columns)
LIST(head(groceries))
## [[1]]
## [1] "citrus fruit"        "margarine"           "ready soups"        
## [4] "semi-finished bread"
## 
## [[2]]
## [1] "coffee"         "tropical fruit" "yogurt"        
## 
## [[3]]
## [1] "whole milk"
## 
## [[4]]
## [1] "cream cheese" "meat spreads" "pip fruit"    "yogurt"      
## 
## [[5]]
## [1] "condensed milk"           "long life bakery product"
## [3] "other vegetables"         "whole milk"              
## 
## [[6]]
## [1] "abrasive cleaner" "butter"           "rice"            
## [4] "whole milk"       "yogurt"

Since there are too many transactions to get a good overview by listing them, we can make use of descriptive statistics, which will more or less tell us the distribution of the number of items in each transaction.

describe(size(groceries))
##    vars    n mean   sd median trimmed  mad min max range skew kurtosis
## X1    1 9835 4.41 3.59      3    3.84 2.97   1  32    31 1.64      3.8
##      se
## X1 0.04
hist(size(groceries), breaks = 16)

We can see that, on average, people bought between 4 to 5 products during one transaction. However, mean is very prone to outliers, thus the median seems like a better option to consider the “average” amount of products. We can also see that there maximum number of products bought is 32 and from the histogram we can observe that the biggest fraction of transactions were 1-product ones. The parameter break decides into how many bins the histogram is split. So in this case having break=16 means that each bin “contains” two products. First bin corresponds to frequency of transactions of 1 or 2 products, second bin of 3 or 4 products and so on.

We can also inspect the frequencies of occurences of certain products. But then again, I will show only the top 30 items according to their importance, since plotting all 169 would be unreadable.

itemFrequencyPlot(
  groceries,
  topN = 30,
  type = "relative",
  main = "Item frequency",
  cex.names = 0.85
) 

We can observe that whole milk is the most frequent choice for customers. Next 4 positions also seem to stand out a bit from the rest, which lacks any significant drops of frequency among the products. The product of my choice is shown on the histogram in two variants: bottled beer and canned beer. To be sure there aren’t any other variants I will look for unique values of products containing the word “beer”.

uniques <- groceries@itemInfo[["labels"]]
uniques[str_detect(uniques, "beer")]
## [1] "bottled beer" "canned beer"

As we can see there are only exactly two products describing beer, so we can be sure now that no beer-related item will be omitted.


Association rules

General rules

Firstly we need to create the rules in our dataset using the Apriori algorithm. Each of the rules’ quality can be measured with 3 measures, which are support, confidence and lift. More on each measure will be in the following subchapters. The apriori algorithm has default minimum values of rules’ support and confidence (0.1 and 0.8, respectively) which are too high for our dataset. No rules were found, thus we should lower the tresholds of minimum support and confidence.

rules <- apriori(groceries, parameter = list(supp = 0.01, conf = 0.45)) 
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.45    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 98 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.01s].
## writing ... [31 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Total of 31 rules were found and that number can be taken into further analysis, although the parameters had to be set to be relatively low to initial values.

Support

Support is measure which tells us how many often a certain set of items appeared in the whole transaction set. In other words it’s probability of appearance. From the 31 rules obtained the ones below are the top 6 according to support values.

rules_supp <- sort(rules, by = "support", decreasing = TRUE)
rules_supp_dt <- inspect(head(rules_supp), linebreak = FALSE)
kable(rules_supp_dt, "html") %>% kable_styling("striped")
lhs rhs support confidence lift count
[1] {domestic eggs} => {whole milk} 0.0299949 0.4727564 1.850203 295
[2] {butter} => {whole milk} 0.0275547 0.4972477 1.946053 271
[3] {curd} => {whole milk} 0.0261312 0.4904580 1.919480 257
[4] {other vegetables,root vegetables} => {whole milk} 0.0231825 0.4892704 1.914833 228
[5] {root vegetables,whole milk} => {other vegetables} 0.0231825 0.4740125 2.449770 228
[6] {other vegetables,yogurt} => {whole milk} 0.0222674 0.5128806 2.007235 219

We can see that the rule with the highest support (almost 3%) is a one where a person buys domestic eggs and whole milk. It means that among the 9835 transactions 295 of them contained both domestic eggs and whole milk. All the transactions with high support values contain whole milk. All in all the most common combinations of products in the our dataset (given the parameters of Apriori algorithm) contained dairy products or vegetables.

Confidence

Confidence describes how likely it is to have item B (rhs) in transaction given that item A (lhs) is in it already. It has maximum value of 1 and it is when customers always buy item B with item A.

rules_conf <- sort(rules, by = "confidence", decreasing = TRUE)
rules_conf_dt <- inspect(head(rules_conf), linebreak = FALSE)
kable(rules_conf_dt, "html") %>% kable_styling("striped") 
lhs rhs support confidence lift count
[1] {citrus fruit,root vegetables} => {other vegetables} 0.0103711 0.5862069 3.029608 102
[2] {root vegetables,tropical fruit} => {other vegetables} 0.0123030 0.5845411 3.020999 121
[3] {curd,yogurt} => {whole milk} 0.0100661 0.5823529 2.279125 99
[4] {butter,other vegetables} => {whole milk} 0.0114896 0.5736041 2.244885 113
[5] {root vegetables,tropical fruit} => {whole milk} 0.0119980 0.5700483 2.230969 118
[6] {root vegetables,yogurt} => {whole milk} 0.0145399 0.5629921 2.203354 143

From the results we can gather that if a person bought citrus fruit and root vegetable he will buy other vegetables with the likelihood of roughly 59%. Overall, the confidence levels are not too high, but it may be caused by the fact that there are many unique products, hence many combinations of them. Again most of the top confidence rules contain whole milk, which suggests that milk is bought regardless of what other products are bought.

Lift

Lift can be seen as a measure of correlation of sorts. It tells us how much more likely it is that items A and B will be bought together than when they are assumed to be unrelated. Values of lift < 1 mean that products are more likely to be bought separately than together and lift > 1 means that products are more likely to be bought together. Lift = 1 means there is no difference.

rules_lift <- sort(rules, by = "lift", decreasing = TRUE)
rules_lift_dt <- inspect(head(rules_lift), linebreak = FALSE)
kable(rules_lift_dt, "html") %>% kable_styling("striped")
lhs rhs support confidence lift count
[1] {citrus fruit,root vegetables} => {other vegetables} 0.0103711 0.5862069 3.029608 102
[2] {root vegetables,tropical fruit} => {other vegetables} 0.0123030 0.5845411 3.020999 121
[3] {rolls/buns,root vegetables} => {other vegetables} 0.0122013 0.5020921 2.594890 120
[4] {root vegetables,yogurt} => {other vegetables} 0.0129131 0.5000000 2.584078 127
[5] {whipped/sour cream,yogurt} => {other vegetables} 0.0101678 0.4901961 2.533410 100
[6] {root vegetables,whole milk} => {other vegetables} 0.0231825 0.4740125 2.449770 228

We can see that the highest value of lift is a little above 3. It implies that other vegetables appeared three times more often in transactionss with citrus fruit and root vegetables than separately. All of the rules with highest lifts contain other vegetables as rhs. What we can infer from that is that “other vegetables” are more likely to be bought with other products (lhs list) than if they were independent. It seems that people don’t go shopping to buy vegetables only.

Beer rules

Since we have a general overview of rules in our set, we can try to find the ones of interest (bottled beer and canned beer). We have to lower both support and confidence values for the algorithm, due to generally lower number of transactions containing beer. The goal was to obtain approximately 5 rules for each type of package. Eventually, given exact same levels of confidence (0.175) and different support levels we obtained 6 rules each.

Bottled beer

rules_bbeer <-
  apriori(
    data = groceries,
    parameter = list(supp = 0.002, conf = 0.175),
    appearance = list(default = "lhs", rhs = "bottled beer"),
    control = list(verbose = F)
  )
rules_bbeer_dt <- inspect(rules_bbeer, linebreak = FALSE)
kable(rules_bbeer_dt, "html") %>% kable_styling("striped") 
lhs rhs support confidence lift count
[1] {liquor} => {bottled beer} 0.0046772 0.4220183 5.240594 46
[2] {red/blush wine} => {bottled beer} 0.0048805 0.2539683 3.153760 48
[3] {bottled water,fruit/vegetable juice} => {bottled beer} 0.0025419 0.1785714 2.217487 25
[4] {bottled water,soda} => {bottled beer} 0.0050839 0.1754386 2.178584 50
[5] {bottled water,whole milk} => {bottled beer} 0.0061007 0.1775148 2.204366 60
[6] {bottled water,other vegetables,whole milk} => {bottled beer} 0.0024403 0.2264151 2.811607 24

We can see that the rule with highest confidence and really high lift is the one that connects bottled beer to liquors. Lift of 5 means that liquors and bottled beer have 5-times the chance of being bought together than independently. It is pretty self explanatory, since most people don’t drink much alcohol on a daily basis and such transactions are made probably for parties or some group gatherings.Two biggest values of support (rule no.4 and no.5) show that beer is also frequently bought with other liquids, such as water, soda or milk. The only food shown in those rules are vegetables, which also is quite peculiar. One would expect someone to buy beer along with some meat for grilling for example.

Beneath we can see a plot mapping the rules onto a 2D plane, where more clearly we can see that rule no.1 stands out from the rest.

plotly_arules(rules_bbeer)
plot(rules_bbeer, method="grouped")

Canned beer

rules_cbeer <-
  apriori(
    data = groceries,
    parameter = list(supp = 0.001, conf = 0.175),
    appearance = list(default = "lhs", rhs = "canned beer"),
    control = list(verbose = F)
  )
rules_cbeer_dt <- inspect(rules_cbeer, linebreak = FALSE)
kable(rules_cbeer_dt, "html") %>% kable_styling("striped")
lhs rhs support confidence lift count
[1] {liquor (appetizer)} => {canned beer} 0.0017285 0.2179487 2.805662 17
[2] {chicken,soda} => {canned beer} 0.0015252 0.1829268 2.354824 15
[3] {coffee,soda} => {canned beer} 0.0019319 0.1938776 2.495793 19
[4] {sausage,shopping bags,soda} => {canned beer} 0.0010168 0.1785714 2.298757 10
[5] {rolls/buns,sausage,shopping bags} => {canned beer} 0.0014235 0.2372881 3.054619 14
[6] {rolls/buns,shopping bags,soda} => {canned beer} 0.0015252 0.2419355 3.114444 15

As we can clearly see, canned beer is generally less popular among customers, support levels of rules are lower. Similarly to bottled beer it is bought often enough with other liquors to have it’s own rule, but it isn’t leading in any of the measures this time. Highest support goes to rule no.3, which contains also coffee and soda, so perhaps canned beer is treated more like a “everyday drink”, than bottled beer. This may also be connected to the fact that 4 out of 6 rules contain food, compared to 1 in previous calculations. Not only it is more common it is somehow, more often associated with beer. Sausages, buns, sodas and canned beer form a pretty popular trio for homemade barbecues or bonfires.

On the plot below we can see that the sausage, buns and canned beer bundle rules are the ones with highest confidence and lift, a little smaller values again for liquors.

plotly_arules(rules_cbeer)
plot(rules_cbeer, method="grouped") 


Summary

Generally speaking, association rules regarding bottled and canned beer proved to be plausible. Bottled beer was bought the most with other liquors or drinks, while canned beer was associated more with food. It leads me to believe that there might be a distinction between them, i.e. bottled beer being “party beer”, whereas canned beer being more of a “barbecue/bonfire beer”.

Whatever the distinctions, shopkeepers could obviously benefit from having such knowledge. They could either bundle the products according to the generated rules. They could give discounts on product A, when you have product B in your shopping cart. Or even something as simple and low-cost as rearranging the shelves in the shop so that beer is near to liquors (it’s like this already, well done shopkeepers) or move the barbecue equipment/necessities toward the beer shelves. Such subliminal messages to the customers’ minds would probably increase the number of sold products by eliminating “lazy” customers who don’t want to go to another aisle for something they remembered just now, while looking at this certain shelf.