Association rules with market basket analysis

The following paper was done on a Kaggle dataset (link) containing 36765 transactions. It will be used in this paper to conduct a hypothetical study on grocery store strategy. Bundles of 2 or more products will be searched in order to be put on sale together. This is a common practice, called cross-selling, used by markets in order to boost sales of less desired products. As we do not possess the exact sales data of the products contained in the set, we will choose a few products from the less prevalent segment. In order to achieve the desired output, market basket analysis will be conducted with the usage of the Apriori algorithm.

Loading necessary packages

library(tidyverse)
library(arules)
library(arulesViz)
library(knitr)

Data pre-processing

data <- read.csv('Groceries_dataset.csv')
head(data)

##   Member_number       Date  itemDescription
## 1          1808 21-07-2015   tropical fruit
## 2          2552 05-01-2015       whole milk
## 3          2300 19-09-2015        pip fruit
## 4          1187 12-12-2015 other vegetables
## 5          3037 01-02-2015       whole milk
## 6          4941 14-02-2015       rolls/buns

As we can see, the data is structured in a way that implies the need of further transformation. The dataset has 3 columns - one containing the id of the customer, the second one containing the date of the transaction and the third one containing a single product. The first step will be to process the data in a way, where the output is a vector of whole daily market basket of the customer. For this purpose, tidyverse will be useful. Along the way, unwanted, excess or inconclusive data will be removed from the dataset (e.g. ‘other vegetables’ - this tells us nothing). Additionally, dates and customer id’s will be removed and replaced by integers marking the number of the transaction. Finally, we will make sure that no empty vectors are present in the column.

data <- subset(data, !grepl("other vegetables", itemDescription, ignore.case = TRUE))
data <- subset(data, !grepl("shopping bags", itemDescription, ignore.case = TRUE))
data <- subset(data, !grepl("newspapers", itemDescription, ignore.case = TRUE))
data_proc <- data %>%
  group_by(Member_number, Date) %>%
  summarise(items = list(itemDescription))

## `summarise()` has grouped output by 'Member_number'. You can override using the
## `.groups` argument.

data_proc$transaction_id <- seq_along(data_proc$Date)
data_proc <- data_proc[, !(names(data_proc) %in% c("Date", "Member_number"))]
data_proc <- data_proc[sapply(data_proc$items, function(vec) length(vec) > 0), ]
head(data_proc)

## # A tibble: 6 × 2
##   items     transaction_id
##   <list>             <int>
## 1 <chr [4]>              1
## 2 <chr [3]>              2
## 3 <chr [2]>              3
## 4 <chr [2]>              4
## 5 <chr [2]>              5
## 6 <chr [2]>              6

With the data processed, we can now save it as a new transactions variable.

trans <- as(data_proc$items, "transactions")

## Warning in asMethod(object): removing duplicated items in transactions

summary(trans)

## transactions as itemMatrix in sparse format with
##  14893 rows (elements/itemsets/transactions) and
##  164 columns (items) and a density of 0.01428278 
## 
## most frequent items:
##      whole milk      rolls/buns            soda          yogurt root vegetables 
##            2363            1646            1453            1285            1041 
##         (Other) 
##           27097 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9 
## 1839 8973 2415 1056  283  157   94   68    8 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.000   2.342   3.000   9.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

itemFrequencyPlot(trans,topN=20,type="absolute")

From the summary of the transactions file, we can deduce that the most frequent item bought was ‘whole milk’, which seems logical. But more interesting to us will be the less popular items.

Constructing the rules

In order to construct the rules, we use the Apriori algorithm. Note that low levels of support and confidence have to be used as the frequent occurrence of ‘whole milk’ seems to distort the dataset. Additionally, the minlen parameter regulates the minimum length of rules to 2.

rules <- apriori(trans, parameter = list(support = 0.001, confidence = 0.1, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.1    0.1    1 none FALSE            TRUE       5   0.001      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 14 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[164 item(s), 14893 transaction(s)] done [0.00s].
## sorting and recoding items ... [146 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [95 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

temp <- kable(inspect(head(rules)))

temp

	lhs		rhs	support	confidence	coverage	lift	count
[1]	{frozen fish}	=>	{whole milk}	0.0010743	0.1568627	0.0068489	0.9886402	16
[2]	{seasonal products}	=>	{rolls/buns}	0.0010072	0.1415094	0.0071174	1.2803767	15
[3]	{pot plants}	=>	{whole milk}	0.0010072	0.1282051	0.0078560	0.8080233	15
[4]	{pasta}	=>	{whole milk}	0.0010743	0.1322314	0.0081246	0.8333992	16
[5]	{pickled vegetables}	=>	{whole milk}	0.0010072	0.1119403	0.0089975	0.7055129	15
[6]	{packaged fruit/vegetables}	=>	{rolls/buns}	0.0012086	0.1417323	0.0085275	1.2823930	18

We get 95 rules, with most of them containing ‘whole milk’. This means, that whole milk will be bought no matter what, therefore it will not be of interest to us. The data will need some more cleaning. All rows with ‘whole milk’ will be removed from the original dataset. Additionally, all vectors with a length below the value of 2 will be removed from the transactions file. The logic behind this is as follows: we want to see, what products we can bundle together. Therefore, we are not interested in products which are bought either way and we are not interested in products bought singularly in a single transaction.

data <- subset(data, !grepl("whole milk", itemDescription, ignore.case = TRUE))
data_proc <- data %>%
  group_by(Member_number, Date) %>%
  summarise(items = list(itemDescription))

## `summarise()` has grouped output by 'Member_number'. You can override using the
## `.groups` argument.

data_proc$transaction_id <- seq_along(data_proc$Date)
data_proc <- data_proc[, !(names(data_proc) %in% c("Date", "Member_number"))]
data_proc <- data_proc[sapply(data_proc$items, function(vec) length(vec) >= 2), ]
trans <- as(data_proc$items, "transactions")

## Warning in asMethod(object): removing duplicated items in transactions

summary(trans)

## transactions as itemMatrix in sparse format with
##  11973 rows (elements/itemsets/transactions) and
##  163 columns (items) and a density of 0.01525775 
## 
## most frequent items:
##      rolls/buns            soda          yogurt root vegetables  tropical fruit 
##            1498            1322            1177             953             926 
##         (Other) 
##           23901 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9 
##  151 8191 2247  857  263  137   82   41    4 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.000   2.487   3.000   9.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

itemFrequencyPlot(trans,topN=20,type="absolute")

Now with ‘whole milk’ and vectors shorter then 2 removed, we can proceed to use the Apriori algorithm once again.

rules <- apriori(trans, parameter = list(support = 0.001, confidence = 0.1, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.1    0.1    1 none FALSE            TRUE       5   0.001      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 11 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[163 item(s), 11973 transaction(s)] done [0.00s].
## sorting and recoding items ... [146 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [37 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

temp <- kable(head(inspect(rules)))

temp

	lhs		rhs	support	confidence	coverage	lift	count
[1]	{seasonal products}	=>	{rolls/buns}	0.0012528	0.1515152	0.0082686	1.2110086	15
[2]	{canned fish}	=>	{rolls/buns}	0.0011693	0.1320755	0.0088533	1.0556339	14
[3]	{pot plants}	=>	{yogurt}	0.0010858	0.1203704	0.0090203	1.2244643	13
[4]	{pasta}	=>	{yogurt}	0.0011693	0.1250000	0.0093544	1.2715590	14
[5]	{pasta}	=>	{soda}	0.0010023	0.1071429	0.0093544	0.9703642	12
[6]	{packaged fruit/vegetables}	=>	{rolls/buns}	0.0015034	0.1512605	0.0099390	1.2089733	18

We can see that the rules are now much more diverse, giving us more insight into more nuanced customer preferences. What we can do now is to view the rules sorted by their confidence, support and lift parameters. The coverage parameter will not be discussed in the following part, as it is the percentage of the frequency of the LHS item appearing in the dataset divided by the dataset size.

Inspecting the rules

Support & Confidence

The first parameter to inspect by us will be the support parameter. It is defined as the number of transactions containing certain items on both sides of the equation divided by the total number of transactions.

Confidence on the other hand, measures the strength of association between two items. It is calculated as the joint support of the 2 items, divided by the support of the left hand side item.

temp <- kable(head(inspect(sort(rules, by = "support", decreasing = TRUE))))
temp1 <- kable(inspect(head(sort(rules, by = "confidence", decreasing = TRUE))))

temp

	lhs		rhs	support	confidence	coverage	lift	count
[1]	{sausage}	=>	{soda}	0.0074334	0.1082725	0.0686545	0.9805951	89
[2]	{sausage}	=>	{yogurt}	0.0071828	0.1046229	0.0686545	1.0642733	86
[3]	{pip fruit}	=>	{rolls/buns}	0.0061806	0.1106129	0.0558757	0.8840906	74
[4]	{fruit/vegetable juice}	=>	{rolls/buns}	0.0046772	0.1174004	0.0398396	0.9383413	56
[5]	{frankfurter}	=>	{rolls/buns}	0.0045937	0.1086957	0.0422618	0.8687671	55
[6]	{pork}	=>	{rolls/buns}	0.0042596	0.1013917	0.0420112	0.8103887	51

temp1

	lhs		rhs	support	confidence	coverage	lift	count
[1]	{processed cheese}	=>	{rolls/buns}	0.0018375	0.1538462	0.0119435	1.229639	22
[2]	{seasonal products}	=>	{rolls/buns}	0.0012528	0.1515152	0.0082686	1.211009	15
[3]	{packaged fruit/vegetables}	=>	{rolls/buns}	0.0015034	0.1512605	0.0099390	1.208973	18
[4]	{red/blush wine}	=>	{rolls/buns}	0.0016704	0.1388889	0.0120271	1.110091	20
[5]	{detergent}	=>	{yogurt}	0.0013363	0.1344538	0.0099390	1.367727	16
[6]	{soft cheese}	=>	{yogurt}	0.0015869	0.1338028	0.0118600	1.361106	19

We can see, that the rule with the biggest support seems to be sausage -> soda. If we search for an empirical explanation, we can look to the grilling season. Sausages and sodas seem to be the perfect combination for a nice barbecue (authors assumption). Although it is the most frequent transaction in this dataset, the lift parameter indicates no positive correlation between the purchase of those items. That being said, values for support are low for every rule in the dataset.

The results of inspecting the rules with the highest confidence are not as informative. A similar situation takes place, where a everyday needs product (in this case ‘rolls/buns’) dominates the dataset. Note how rules with rolls/buns have a high confidence and lift. This might indicate that rolls/buns are rarely bought by themselves, usually being picked up along the way. The same can be said for the items on the left hand side of the rules. Those are most probably everyday groceries of the households.

Lift

Lift tells us, how much more likely item B is to be bought with item A than without it. It is calculated as the confidence of the transaction divided by the support of good B. For the assumed purpose of this paper, this will be the most significant measure as it can tell us how to boost sales of less desired items by bundling them with other items.

temp <- kable(inspect(head(sort(rules, by = "lift", decreasing = TRUE))))

temp

	lhs		rhs	support	confidence	coverage	lift	count
[1]	{beverages}	=>	{sausage}	0.0019210	0.1017699	0.0188758	1.482349	23
[2]	{flour}	=>	{tropical fruit}	0.0013363	0.1142857	0.0116930	1.477692	16
[3]	{processed cheese}	=>	{root vegetables}	0.0013363	0.1118881	0.0119435	1.405704	16
[4]	{detergent}	=>	{yogurt}	0.0013363	0.1344538	0.0099390	1.367727	16
[5]	{soft cheese}	=>	{yogurt}	0.0015869	0.1338028	0.0118600	1.361106	19
[6]	{pasta}	=>	{yogurt}	0.0011693	0.1250000	0.0093544	1.271559	14

As we can see, sausage is more likely to be bought with beverages (which is very interesting while combined with the results of the Support section, we will touch on that in the Conclusions section). Tropical fruit is more likely to be bought with flour, which might indicate a nice bundle for bakers, experimenting with new cakes containing tropical fruit. Root vegetables are more likely to be bought with processed cheese, which might be a sign of the communities breakfast and dinner needs, but those ingredients may also be used for a plethora of nice meals, one of them being a potato casserole (this is just the speculation of the author).

Visualisation

plot(rules, 
     method = "graph", 
     limit = 20, 
)

Above is a visualization of the rules, along with the lift and support of particular items or item pairs. It is not very readable, a more effective way would be to use the rulesExplorer() function from arulesViz to explore the rules with the option of zooming into the graph (it is sadly not compatible with html markdowns). What the graph is trying to portray reinforces the association of sausages and beverages, it also uncovers the unlikely but logical association of oil and soda. The story behind this pair might be dinner preferences of the populous, frying the dinner and having a soda to go with it.

plot(rules, 
     method = "scatterplot", 
     limit = 20, 
)

The analysis of the scatter plot showing the relation of support, confidence and lift reveals that there aren’t really any rules for which support would be high. This might indicate that the product bundles with high confidence are made up of products that are rarely bought in general. Therefore our focus should shift to lift, which will be the basis for our conclusions.

plot(rules, 
     method = "matrix", 
     limit = 20, 
)

## Itemsets in Antecedent (LHS)
##  [1] "{flour}"                     "{soft cheese}"              
##  [3] "{processed cheese}"          "{beverages}"                
##  [5] "{pasta}"                     "{chewing gum}"              
##  [7] "{pot plants}"                "{seasonal products}"        
##  [9] "{packaged fruit/vegetables}" "{detergent}"                
## [11] "{oil}"                       "{herbs}"                    
## [13] "{red/blush wine}"            "{sausage}"                  
## [15] "{canned fish}"               "{ice cream}"                
## [17] "{chocolate}"                
## Itemsets in Consequent (RHS)
## [1] "{rolls/buns}"      "{soda}"            "{yogurt}"         
## [4] "{root vegetables}" "{tropical fruit}"  "{sausage}"

On the last graph, we can see the visualization of a matrix, determining the lift of the rules. Lower indices tend of the LHS of the rules tend to have a higher lift with the RHS when compared with lower higher indices. That being said, while the higher indices of the RHS have a higher lift, the lower indices of it more frequently have a lift of >1.

Finding optimal sales bundles with market basket analysis

Jan Szczepanek

2024-02-22