2.1 Main data
groceries
## transactions in sparse format with
## 14964 transactions (rows) and
## 168 items (columns)
Association Rule Mining is often used in economy to improve the marketing strategies. At the same time it is one of the most popular data mining methods. It allows sellers to arrange the store in the way to increase the sales by putting frequently bought-together products next to each other.
The dataset is from Kaggle. It consists of the products from grocery stores, eg. bread, water, beer among others.
library(arules)
library(tidyverse)
library(arulesViz)
library(knitr)
library(kableExtra)
We have to save the transactions in basket format. This will allow us to use the itemFrequencyPlot
groceries = read.transactions(file="ItemList.csv", rm.duplicates= TRUE, format="basket",sep=",",cols=1);
## distribution of transactions with duplicates:
## items
## 1 2 3 4
## 662 39 5 1
Below,is a basic description or the dataset. After visualizing the main data,we will split and filter the required variables.
groceries
## transactions in sparse format with
## 14964 transactions (rows) and
## 168 items (columns)
summary(groceries)
## transactions as itemMatrix in sparse format with
## 14964 rows (elements/itemsets/transactions) and
## 168 columns (items) and a density of 0.01511843
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2363 1827 1646 1453
## yogurt (Other)
## 1285 29433
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10
## 206 10012 2727 1273 338 179 113 96 19 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 2.00 2.54 3.00 10.00
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
##
## includes extended transaction information - examples:
## transactionID
## 1
## 2 1
## 3 2
str(groceries)
## Formal class 'transactions' [package "arules"] with 3 slots
## ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
## .. .. ..@ i : int [1:38007] 75 130 132 165 166 105 128 165 18 92 ...
## .. .. ..@ p : int [1:14965] 0 1 5 8 10 12 14 16 19 21 ...
## .. .. ..@ Dim : int [1:2] 168 14964
## .. .. ..@ Dimnames:List of 2
## .. .. .. ..$ : NULL
## .. .. .. ..$ : NULL
## .. .. ..@ factors : list()
## ..@ itemInfo :'data.frame': 168 obs. of 1 variable:
## .. ..$ labels: chr [1:168] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "bags" ...
## ..@ itemsetInfo:'data.frame': 14964 obs. of 1 variable:
## .. ..$ transactionID: chr [1:14964] "" "1" "2" "3" ...
head(groceries)
## transactions in sparse format with
## 6 transactions (rows) and
## 168 items (columns)
Here is a histogram showing a sample of the first 30 variables. The dataset is quite large. It shows the most frequently purchased items in those grocery shops.From the plot below we can see that the whole milk is the most frequently purchased, in the opposite to the white bread.
itemFrequencyPlot(groceries, topN = 30)
Using associative code mining methods, we aim to discover meaningful relationships between these bakeries and other small businesses. Specifically, we seek to identify co-occurrence patterns, understand purchase frequency, and explore potential opportunities for successful placement or promotion of white and brown bread
In association rule mining, the Apriori algorithm stands as the first method designed to mine valuable patterns in data sets. Developed in 1994 by Rakesh Agarwal and Ramakrishnan Srikanth , Apriori is a basic algorithm for identifying regularly set objects and generating association rules The values were lowered to 0.0002 (support) and 0.9 (confidence) since there were no rules.
rules <- apriori(groceries, parameter = list(supp = 0.0002, conf = 0.9))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.9 0.1 1 none FALSE TRUE 5 2e-04 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 2
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[168 item(s), 14964 transaction(s)] done [0.00s].
## sorting and recoding items ... [165 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [25 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
As we delve into bread purchasing research, it is important to unpack the complex determinants of consumer choice. Focusing on a specific comparison of white bread and blue bread, we aim to identify preferences and associations in the data set.
Support (s) counts the number of times an item or itemset appears in the dataset. It shows how regularly a specific mixture of products seems collectively in transactions. It is decided because the range of transactions that include the item(s) divided by using the whole variety of transactions.
shop_support = sort(rules, by = "support", decreasing = TRUE)
shop_support_df = inspect(head(shop_support), linebreak = FALSE)
shop_support_df %>%
kable() %>%
kable_styling()
| lhs | rhs | support | confidence | coverage | lift | count | ||
|---|---|---|---|---|---|---|---|---|
| [1] | {house keeping products, other vegetables} | => | {whole milk} | 0.0002673 | 1 | 0.0002673 | 6.332628 | 4 |
| [2] | {house keeping products, margarine} | => | {whole milk} | 0.0002005 | 1 | 0.0002005 | 6.332628 | 3 |
| [3] | {flower (seeds), pork} | => | {whole milk} | 0.0002005 | 1 | 0.0002005 | 6.332628 | 3 |
| [4] | {canned vegetables, domestic eggs} | => | {whole milk} | 0.0002005 | 1 | 0.0002005 | 6.332628 | 3 |
| [5] | {butter, processed cheese} | => | {whole milk} | 0.0002005 | 1 | 0.0002005 | 6.332628 | 3 |
| [6] | {canned beer, hygiene articles, soda} | => | {whole milk} | 0.0002005 | 1 | 0.0002005 | 6.332628 | 3 |
Confidence (c) is a metric that measures the strength of an affiliation rule between factors. It assesses the probability of coming across the following object(s) given the presence of the antecedent object(s). Confidence is decided because the number of transactions having both the antecedent and subsequent items divided through the wide variety of transactions containing the antecedent item(s).
shop_confidence = sort(rules, by = "confidence", decreasing = TRUE)
shop_confidence_df = inspect(head(shop_confidence), linebreak = FALSE)
shop_confidence_df %>%
kable() %>%
kable_styling()
| lhs | rhs | support | confidence | coverage | lift | count | ||
|---|---|---|---|---|---|---|---|---|
| [1] | {house keeping products, margarine} | => | {whole milk} | 0.0002005 | 1 | 0.0002005 | 6.332628 | 3 |
| [2] | {house keeping products, other vegetables} | => | {whole milk} | 0.0002673 | 1 | 0.0002673 | 6.332628 | 4 |
| [3] | {flower (seeds), pork} | => | {whole milk} | 0.0002005 | 1 | 0.0002005 | 6.332628 | 3 |
| [4] | {canned vegetables, domestic eggs} | => | {whole milk} | 0.0002005 | 1 | 0.0002005 | 6.332628 | 3 |
| [5] | {butter, processed cheese} | => | {whole milk} | 0.0002005 | 1 | 0.0002005 | 6.332628 | 3 |
| [6] | {canned beer, hygiene articles, soda} | => | {whole milk} | 0.0002005 | 1 | 0.0002005 | 6.332628 | 3 |
Lift is thought as a degree of kinds correlation. Put without a doubt, it says about how likely it is that merchandise X and Y will be bought collectively or one at a time. A cost extra than one says that merchandise need to be bought together, a price much less than one says that they have to be sold one by one.
shop_lift = sort(rules, by = "lift", decreasing = TRUE)
shop_lift_df = inspect(head(shop_lift), linebreak = FALSE)
shop_lift_df %>%
kable() %>%
kable_styling()
| lhs | rhs | support | confidence | coverage | lift | count | ||
|---|---|---|---|---|---|---|---|---|
| [1] | {chicken, citrus fruit, cream cheese} | => | {specialty chocolate} | 0.0002005 | 1 | 0.0002005 | 62.61088 | 3 |
| [2] | {frankfurter, root vegetables, soda} | => | {hamburger meat} | 0.0002005 | 1 | 0.0002005 | 45.76147 | 3 |
| [3] | {coffee, sausage, soda} | => | {frankfurter} | 0.0002005 | 1 | 0.0002005 | 26.48496 | 3 |
| [4] | {chicken, cream cheese, specialty chocolate} | => | {citrus fruit} | 0.0002005 | 1 | 0.0002005 | 18.82264 | 3 |
| [5] | {other vegetables, tropical fruit, whipped/sour cream} | => | {sausage} | 0.0002005 | 1 | 0.0002005 | 16.57143 | 3 |
| [6] | {pork, soda, whole milk, yogurt} | => | {sausage} | 0.0002005 | 1 | 0.0002005 | 16.57143 | 3 |
Analyzing the values of the top six transactions, we can see that for all of them Lift values are higher than one. So we can conclude that rhs products are more likely to be bought with other products (lhs list) than if they were independent. For {chicken, citrus fruit, cream cheese} => {specialty chocolate} rule, items have been seen together in transactions at the 62.61 rate expected under independence between them.
market_df <- as(rules, "data.frame")
ggplot(market_df, aes(x = support, y = confidence, size = lift)) +
geom_point(color = "blue") +
labs(title = "Support vs Confidence with Lift") +
theme(plot.title = element_text(hjust = 0.1))
We can use some scatter plots to visualize the data. To do so we use two interest measures - one on each of the axes - the “confidence” variable as Y and the “support” one as X.
plot(rules, jitter= 0)
plot(rules, measure = "confidence")
plot(rules, method = "two-key plot")
plot(rules, engine = "plotly")
In the context of association rule mining, a grouped bar plot is often used to visualize the support, confidence, and lift values of different rules.
plot(rules, method = "grouped", control = list(k = 5))
plot(rules, method="paracoord")
plot(rules[1:20], method="graph")
plot(rules, method="graph")
Hahsler, M., & Karpienko, R. (2017). Visualizing association rules in hierarchical groups. Journal of Business Economics, 87(3), 317–335. https://doi.org/10.1007/s11573-016-0822-8
https://rpubs.com/eosowska/basket_analysis