This report explores association rules using the groceries dataset. The aim is to identify relationships between purchased items and uncover patterns using unsupervised learning techniques.
Several publications were made about association rules on grocedries related datasets. One of them is “Association Rules for Purchase Dependency of Grocery Items” by S.s. Radiah Shariff. In order to minimize lost sales by integrating purchase reliance aspects into the inventory model, the study identifies the presence of purchase dependence in grocery products in retail stores and supermarkets. The association’s analysis in this study is based on primary data collected from 130 consumer sales transactions in retail stores and the supermarket. Studying the relationship between product categories in retail stores and supermarkets is the aim of association analysis. Excel spreadsheets are used to simulate the inventory model based on the results of the association analysis. According to the simulation results, compared to the inventory model that disregarded purchase dependencies, the extended inventory models had reduced average overall inventory costs.
Park and Seo (2013) developed purchase dependencies in retail stores by introducing the practice of purchasing spare parts from the Hyundai Engine Europe Service Center’s inventory operations (HEESC). The topic of cross-channel free-riding on consumers’ purchasing behavior was examined by Heitz-Spahn (2013). Gaining more knowledge about cross-channel free-riding in a multichannel retailing setting is the aim. In order to answer questions on their choices of channels and shops, as well as whether they are retention or free-riding consumers, 741 French respondents took part in the online poll.
The dataset used in this project is the Groceries dataset.
Link to the dataset -> https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset
library(arules)
## Warning: package 'arules' was built under R version 4.4.2
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
## Warning: package 'arulesViz' was built under R version 4.4.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:arules':
##
## intersect, recode, setdiff, setequal, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
groceries <- read.csv("C://Users/Admin/Desktop/Groceries_dataset.csv")
head(groceries)
## Member_number Date itemDescription
## 1 1808 21-07-2015 tropical fruit
## 2 2552 05-01-2015 whole milk
## 3 2300 19-09-2015 pip fruit
## 4 1187 12-12-2015 other vegetables
## 5 3037 01-02-2015 whole milk
## 6 4941 14-02-2015 rolls/buns
Head function is useful in order to look at the first 6 elements of the dataset.
summary(groceries)
## Member_number Date itemDescription
## Min. :1000 Length:38765 Length:38765
## 1st Qu.:2002 Class :character Class :character
## Median :3005 Mode :character Mode :character
## Mean :3004
## 3rd Qu.:4007
## Max. :5000
Summary function is executed in order to better understand the dataset.
Data Preprocessing is a must procedure in any Unsupervised Learning related projects. Data always needs to be well-structured before any actions.
Convert transactions to a suitable format:
groceries_trans <- split(groceries$itemDescription, groceries$Member_number)
transactions <- as(groceries_trans, "transactions", warning(call=FALSE))
## Warning in as(groceries_trans, "transactions", warning(call = FALSE)): FALSE
## Warning in asMethod(object): removing duplicated items in transactions
summary(transactions)
## transactions as itemMatrix in sparse format with
## 3898 rows (elements/itemsets/transactions) and
## 167 columns (items) and a density of 0.05340678
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 1786 1468 1363 1222
## yogurt (Other)
## 1103 27824
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 6 248 87 331 261 381 303 332 340 296 276 238 181 179 123 97 66 46 39 28
## 21 22 23 24 25 26
## 15 13 3 5 2 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 6.000 8.500 8.919 12.000 26.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
##
## includes extended transaction information - examples:
## transactionID
## 1 1000
## 2 1001
## 3 1002
Now, our dataset is in a form of transactions. It is necessary in order to perform the Association rules on the dataset.
Item Frequency is a technique used to see which items occur the most in the dataset.
Frequency of the items:
itemFrequencyPlot(transactions, topN=15, type="absolute",ylim=c(0,2000), main="Item frequency", col="#56B4E9")
From the graph above, it can be observed that the most frequent items in the dataset are whole milk and other vegetables.The same graph can be represented in form of the percentages:
itemFrequencyPlot(transactions, topN=15, type="relative",ylim=c(0,0.5), main="Item frequency", col="#56B4E9")
From the item frequency plot, it can be deduced that the whole milk has been the most popular choice (almost 50%).
#Apriori
Apriori is an algorithm to create the association rules within the dataset. It has its own parameters like support level, confidence and lift. All of these parameters should be set to minimum standards but cannot be too much low, because no rules will be found without sufficient levels of support, confidence and lift.
rules <- apriori(transactions)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 389
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[167 item(s), 3898 transaction(s)] done [0.00s].
## sorting and recoding items ... [29 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
The defaults of apriori algorithm should be modified to get the optimal number of rules.
After several attempts with Apriori, I came out with the minimal support level of 0.1 and confidence of 0.5.
rules <- apriori(transactions,parameter=list(supp=0.1, conf=0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 389
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[167 item(s), 3898 transaction(s)] done [0.00s].
## sorting and recoding items ... [29 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [5 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(rules)
## set of 5 rules
##
## rule length distribution (lhs + rhs):sizes
## 2
## 5
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 2 2 2 2 2
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.1070 Min. :0.5082 Min. :0.2060 Min. :1.109
## 1st Qu.:0.1124 1st Qu.:0.5106 1st Qu.:0.2137 1st Qu.:1.114
## Median :0.1506 Median :0.5193 Median :0.2830 Median :1.133
## Mean :0.1480 Mean :0.5192 Mean :0.2858 Mean :1.133
## 3rd Qu.:0.1786 3rd Qu.:0.5258 3rd Qu.:0.3497 3rd Qu.:1.148
## Max. :0.1914 Max. :0.5322 Max. :0.3766 Max. :1.162
## count
## Min. :417.0
## 1st Qu.:438.0
## Median :587.0
## Mean :576.8
## 3rd Qu.:696.0
## Max. :746.0
##
## mining info:
## data ntransactions support confidence
## transactions 3898 0.1 0.5
## call
## apriori(data = transactions, parameter = list(supp = 0.1, conf = 0.5))
Now, there are 5 rules found by the algorithm.
set.seed(240)
plot(rules, measure=c("support","lift"), shading="confidence", main="Groceries Transactions Rules")
The graph shows the link between the support level and lift on the groceries transaction rules.The rules with high lift and support level are the most red dots.
plot(rules, method="graph", measure="support", shading="lift")
From the graph, it is evident that the Whole Milk appears in all rules and was the most popular transaction.
Now, support, confidence and lift levels can be inspected for a better understanding and visualization.
inspect(sort(rules, by = "support"), linebreak = FALSE, decreasing=TRUE)
## lhs rhs support confidence coverage lift
## [1] {other vegetables} => {whole milk} 0.1913802 0.5081744 0.3766034 1.109106
## [2] {rolls/buns} => {whole milk} 0.1785531 0.5106383 0.3496665 1.114484
## [3] {yogurt} => {whole milk} 0.1505900 0.5321850 0.2829656 1.161510
## [4] {bottled water} => {whole milk} 0.1123653 0.5258103 0.2136993 1.147597
## [5] {sausage} => {whole milk} 0.1069779 0.5193026 0.2060031 1.133394
## count
## [1] 746
## [2] 696
## [3] 587
## [4] 438
## [5] 417
inspect(sort(rules, by = "confidence"), linebreak = FALSE, decreasing=TRUE)
## lhs rhs support confidence coverage lift
## [1] {yogurt} => {whole milk} 0.1505900 0.5321850 0.2829656 1.161510
## [2] {bottled water} => {whole milk} 0.1123653 0.5258103 0.2136993 1.147597
## [3] {sausage} => {whole milk} 0.1069779 0.5193026 0.2060031 1.133394
## [4] {rolls/buns} => {whole milk} 0.1785531 0.5106383 0.3496665 1.114484
## [5] {other vegetables} => {whole milk} 0.1913802 0.5081744 0.3766034 1.109106
## count
## [1] 587
## [2] 438
## [3] 417
## [4] 696
## [5] 746
inspect(sort(rules, by = "lift"), linebreak = FALSE, decreasing=TRUE)
## lhs rhs support confidence coverage lift
## [1] {yogurt} => {whole milk} 0.1505900 0.5321850 0.2829656 1.161510
## [2] {bottled water} => {whole milk} 0.1123653 0.5258103 0.2136993 1.147597
## [3] {sausage} => {whole milk} 0.1069779 0.5193026 0.2060031 1.133394
## [4] {rolls/buns} => {whole milk} 0.1785531 0.5106383 0.3496665 1.114484
## [5] {other vegetables} => {whole milk} 0.1913802 0.5081744 0.3766034 1.109106
## count
## [1] 587
## [2] 438
## [3] 417
## [4] 696
## [5] 746
plot(rules, method="paracoord", control=list(reorder=TRUE))
This plot shows that yogurt has more link to the whole milk, as people when buying yogurt will most likely buy whole milk too.
rules_yogurt<-apriori(data=transactions, parameter=list(supp=0.01,conf = 0.005, minlen=2), appearance=list(default="rhs", lhs="yogurt"), control=list(verbose=F))
rules_yogurt_byconf<-sort(rules_yogurt, by="confidence", decreasing=TRUE)
inspect((rules_yogurt_byconf)[1:2], linebreak = FALSE)
## lhs rhs support confidence coverage lift count
## [1] {yogurt} => {whole milk} 0.1505900 0.532185 0.2829656 1.16151 587
## [2] {yogurt} => {other vegetables} 0.1203181 0.425204 0.2829656 1.12905 469
This means that people who buy yogurt, will most likely buy other vegetables too as their second choice (whole milk is obviously the best choice).
rules_whole_milk<-apriori(data=transactions, parameter=list(supp=0.01,conf = 0.005, minlen=2), appearance=list(default="rhs", lhs="whole milk"), control=list(verbose=F))
rules_whole_milk_byconf<-sort(rules_whole_milk, by="confidence", decreasing=TRUE)
inspect((rules_whole_milk_byconf)[1:2], linebreak = FALSE)
## lhs rhs support confidence coverage lift
## [1] {whole milk} => {other vegetables} 0.1913802 0.4176932 0.4581837 1.109106
## [2] {whole milk} => {rolls/buns} 0.1785531 0.3896976 0.4581837 1.114484
## count
## [1] 746
## [2] 696
People who buy whole milk, there is a high possibility that they will buy other vegetables too.
The analysis provided insights into customer purchasing grocery patterns through the method called association rules. It was proved through the visualisation of graphs and other techniques of the apriori algorithm that the most popular choice is whole milk. It dominated all 5 rules. The methods used in this project can be used in other datasets too.