1 Introduction

Association Rule Mining is often used in economy to improve the marketing strategies. At the same time it is one of the most popular data mining methods. It allows sellers to arrange the store in the way to increase the sales by putting frequently bought-together products next to each other.

2 Data Collection

The dataset is from Kaggle. It consists of the products from grocery stores, eg. bread, water, beer among others.

library(arules)
library(tidyverse) 
library(arulesViz)
library(knitr)
library(kableExtra)

We have to save the transactions in basket format. This will allow us to use the itemFrequencyPlot

groceries = read.transactions(file="ItemList.csv", rm.duplicates= TRUE, format="basket",sep=",",cols=1);

## distribution of transactions with duplicates:
## items
##   1   2   3   4 
## 662  39   5   1

Below,is a basic description or the dataset. After visualizing the main data,we will split and filter the required variables.

2.1 Main data

groceries

## transactions in sparse format with
##  14964 transactions (rows) and
##  168 items (columns)

2.2 Summary

summary(groceries)

## transactions as itemMatrix in sparse format with
##  14964 rows (elements/itemsets/transactions) and
##  168 columns (items) and a density of 0.01511843 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2363             1827             1646             1453 
##           yogurt          (Other) 
##             1285            29433 
## 
## element (itemset/transaction) length distribution:
## sizes
##     1     2     3     4     5     6     7     8     9    10 
##   206 10012  2727  1273   338   179   113    96    19     1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00    2.00    2.54    3.00   10.00 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics
## 
## includes extended transaction information - examples:
##   transactionID
## 1              
## 2             1
## 3             2

2.3 Structure

str(groceries)

## Formal class 'transactions' [package "arules"] with 3 slots
##   ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
##   .. .. ..@ i       : int [1:38007] 75 130 132 165 166 105 128 165 18 92 ...
##   .. .. ..@ p       : int [1:14965] 0 1 5 8 10 12 14 16 19 21 ...
##   .. .. ..@ Dim     : int [1:2] 168 14964
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : NULL
##   .. .. .. ..$ : NULL
##   .. .. ..@ factors : list()
##   ..@ itemInfo   :'data.frame':  168 obs. of  1 variable:
##   .. ..$ labels: chr [1:168] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "bags" ...
##   ..@ itemsetInfo:'data.frame':  14964 obs. of  1 variable:
##   .. ..$ transactionID: chr [1:14964] "" "1" "2" "3" ...

2.4 Top items

head(groceries)

## transactions in sparse format with
##  6 transactions (rows) and
##  168 items (columns)

3 Data Analysis

Here is a histogram showing a sample of the first 30 variables. The dataset is quite large. It shows the most frequently purchased items in those grocery shops.From the plot below we can see that the whole milk is the most frequently purchased, in the opposite to the white bread.

itemFrequencyPlot(groceries, topN = 30)

4 Association Rule Mining

Using associative code mining methods, we aim to discover meaningful relationships between these bakeries and other small businesses. Specifically, we seek to identify co-occurrence patterns, understand purchase frequency, and explore potential opportunities for successful placement or promotion of white and brown bread

4.1 Algorithm Selection

In association rule mining, the Apriori algorithm stands as the first method designed to mine valuable patterns in data sets. Developed in 1994 by Rakesh Agarwal and Ramakrishnan Srikanth , Apriori is a basic algorithm for identifying regularly set objects and generating association rules The values were lowered to 0.0002 (support) and 0.9 (confidence) since there were no rules.

rules <- apriori(groceries, parameter = list(supp = 0.0002, conf = 0.9))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.9    0.1    1 none FALSE            TRUE       5   2e-04      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 2 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[168 item(s), 14964 transaction(s)] done [0.00s].
## sorting and recoding items ... [165 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [25 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

4.2 Patterns in Purchases

As we delve into bread purchasing research, it is important to unpack the complex determinants of consumer choice. Focusing on a specific comparison of white bread and blue bread, we aim to identify preferences and associations in the data set.

4.2.1 Support

Support (s) counts the number of times an item or itemset appears in the dataset. It shows how regularly a specific mixture of products seems collectively in transactions. It is decided because the range of transactions that include the item(s) divided by using the whole variety of transactions.

shop_support = sort(rules, by = "support", decreasing = TRUE)
shop_support_df = inspect(head(shop_support), linebreak = FALSE)

shop_support_df %>%
  kable() %>%
  kable_styling()

	lhs		rhs	support	confidence	coverage	lift	count
[1]	{house keeping products, other vegetables}	=>	{whole milk}	0.0002673	1	0.0002673	6.332628	4
[2]	{house keeping products, margarine}	=>	{whole milk}	0.0002005	1	0.0002005	6.332628	3
[3]	{flower (seeds), pork}	=>	{whole milk}	0.0002005	1	0.0002005	6.332628	3
[4]	{canned vegetables, domestic eggs}	=>	{whole milk}	0.0002005	1	0.0002005	6.332628	3
[5]	{butter, processed cheese}	=>	{whole milk}	0.0002005	1	0.0002005	6.332628	3
[6]	{canned beer, hygiene articles, soda}	=>	{whole milk}	0.0002005	1	0.0002005	6.332628	3

4.2.2 Confidence

Confidence (c) is a metric that measures the strength of an affiliation rule between factors. It assesses the probability of coming across the following object(s) given the presence of the antecedent object(s). Confidence is decided because the number of transactions having both the antecedent and subsequent items divided through the wide variety of transactions containing the antecedent item(s).

shop_confidence = sort(rules, by = "confidence", decreasing = TRUE)
shop_confidence_df = inspect(head(shop_confidence), linebreak = FALSE)

shop_confidence_df %>%
  kable() %>%
  kable_styling()

	lhs		rhs	support	confidence	coverage	lift	count
[1]	{house keeping products, margarine}	=>	{whole milk}	0.0002005	1	0.0002005	6.332628	3
[2]	{house keeping products, other vegetables}	=>	{whole milk}	0.0002673	1	0.0002673	6.332628	4
[3]	{flower (seeds), pork}	=>	{whole milk}	0.0002005	1	0.0002005	6.332628	3
[4]	{canned vegetables, domestic eggs}	=>	{whole milk}	0.0002005	1	0.0002005	6.332628	3
[5]	{butter, processed cheese}	=>	{whole milk}	0.0002005	1	0.0002005	6.332628	3
[6]	{canned beer, hygiene articles, soda}	=>	{whole milk}	0.0002005	1	0.0002005	6.332628	3

4.2.3 Lift

Lift is thought as a degree of kinds correlation. Put without a doubt, it says about how likely it is that merchandise X and Y will be bought collectively or one at a time. A cost extra than one says that merchandise need to be bought together, a price much less than one says that they have to be sold one by one.

shop_lift = sort(rules, by = "lift", decreasing = TRUE)
shop_lift_df = inspect(head(shop_lift), linebreak = FALSE)

shop_lift_df %>%
  kable() %>%
  kable_styling()

	lhs		rhs	support	confidence	coverage	lift	count
[1]	{chicken, citrus fruit, cream cheese}	=>	{specialty chocolate}	0.0002005	1	0.0002005	62.61088	3
[2]	{frankfurter, root vegetables, soda}	=>	{hamburger meat}	0.0002005	1	0.0002005	45.76147	3
[3]	{coffee, sausage, soda}	=>	{frankfurter}	0.0002005	1	0.0002005	26.48496	3
[4]	{chicken, cream cheese, specialty chocolate}	=>	{citrus fruit}	0.0002005	1	0.0002005	18.82264	3
[5]	{other vegetables, tropical fruit, whipped/sour cream}	=>	{sausage}	0.0002005	1	0.0002005	16.57143	3
[6]	{pork, soda, whole milk, yogurt}	=>	{sausage}	0.0002005	1	0.0002005	16.57143	3

Analyzing the values of the top six transactions, we can see that for all of them Lift values are higher than one. So we can conclude that rhs products are more likely to be bought with other products (lhs list) than if they were independent. For {chicken, citrus fruit, cream cheese} => {specialty chocolate} rule, items have been seen together in transactions at the 62.61 rate expected under independence between them.

4.3 Support vs Confidence with Lift

market_df <- as(rules, "data.frame")
ggplot(market_df, aes(x = support, y = confidence, size = lift)) +
  geom_point(color = "blue") +
  labs(title = "Support vs Confidence with Lift") +
  theme(plot.title = element_text(hjust = 0.1))

5 Visualization

5.1 Scatter Plots

We can use some scatter plots to visualize the data. To do so we use two interest measures - one on each of the axes - the “confidence” variable as Y and the “support” one as X.

5.1.1 Scatter Plot

plot(rules, jitter= 0)

5.1.2 Scatter plot with confidence

plot(rules, measure = "confidence")

5.1.3 Two key plots

plot(rules, method = "two-key plot")

5.2 Interactive Scatter plots

plot(rules, engine = "plotly")

5.3 Grouped Bar plot

In the context of association rule mining, a grouped bar plot is often used to visualize the support, confidence, and lift values of different rules.

plot(rules, method = "grouped", control = list(k = 5))

5.4 Parallel coordinate plot

plot(rules, method="paracoord")

5.5 Graphs

5.5.1 A

plot(rules[1:20], method="graph")

5.5.2 B

plot(rules, method="graph")

6 References

Hahsler, M., & Karpienko, R. (2017). Visualizing association rules in hierarchical groups. Journal of Business Economics, 87(3), 317–335. https://doi.org/10.1007/s11573-016-0822-8

https://rpubs.com/eosowska/basket_analysis

Market Basket Analysis by Xaviar

Introduction to Association Rule Mining in R

Association Rule Mining

Daisy Mutua

2024-02-13