Introduction

This report explores association rules using the groceries dataset. The aim is to identify relationships between purchased items and uncover patterns using unsupervised learning techniques.

Literature Review

Several publications were made about association rules on grocedries related datasets. One of them is “Association Rules for Purchase Dependency of Grocery Items” by S.s. Radiah Shariff. In order to minimize lost sales by integrating purchase reliance aspects into the inventory model, the study identifies the presence of purchase dependence in grocery products in retail stores and supermarkets. The association’s analysis in this study is based on primary data collected from 130 consumer sales transactions in retail stores and the supermarket. Studying the relationship between product categories in retail stores and supermarkets is the aim of association analysis. Excel spreadsheets are used to simulate the inventory model based on the results of the association analysis. According to the simulation results, compared to the inventory model that disregarded purchase dependencies, the extended inventory models had reduced average overall inventory costs.

Park and Seo (2013) developed purchase dependencies in retail stores by introducing the practice of purchasing spare parts from the Hyundai Engine Europe Service Center’s inventory operations (HEESC). The topic of cross-channel free-riding on consumers’ purchasing behavior was examined by Heitz-Spahn (2013). Gaining more knowledge about cross-channel free-riding in a multichannel retailing setting is the aim. In order to answer questions on their choices of channels and shops, as well as whether they are retention or free-riding consumers, 741 French respondents took part in the online poll.

Dataset

The dataset used in this project is the Groceries dataset.

Link to the dataset -> https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset

library(arules)
## Warning: package 'arules' was built under R version 4.4.2
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(arulesViz)
## Warning: package 'arulesViz' was built under R version 4.4.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:arules':
## 
##     intersect, recode, setdiff, setequal, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
groceries <- read.csv("C://Users/Admin/Desktop/Groceries_dataset.csv")
head(groceries)
##   Member_number       Date  itemDescription
## 1          1808 21-07-2015   tropical fruit
## 2          2552 05-01-2015       whole milk
## 3          2300 19-09-2015        pip fruit
## 4          1187 12-12-2015 other vegetables
## 5          3037 01-02-2015       whole milk
## 6          4941 14-02-2015       rolls/buns

Head function is useful in order to look at the first 6 elements of the dataset.

summary(groceries)
##  Member_number      Date           itemDescription   
##  Min.   :1000   Length:38765       Length:38765      
##  1st Qu.:2002   Class :character   Class :character  
##  Median :3005   Mode  :character   Mode  :character  
##  Mean   :3004                                        
##  3rd Qu.:4007                                        
##  Max.   :5000

Summary function is executed in order to better understand the dataset.

Data Preprocessing

Data Preprocessing is a must procedure in any Unsupervised Learning related projects. Data always needs to be well-structured before any actions.

Convert transactions to a suitable format:

groceries_trans <- split(groceries$itemDescription, groceries$Member_number)
transactions <- as(groceries_trans, "transactions", warning(call=FALSE))
## Warning in as(groceries_trans, "transactions", warning(call = FALSE)): FALSE
## Warning in asMethod(object): removing duplicated items in transactions
summary(transactions)
## transactions as itemMatrix in sparse format with
##  3898 rows (elements/itemsets/transactions) and
##  167 columns (items) and a density of 0.05340678 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             1786             1468             1363             1222 
##           yogurt          (Other) 
##             1103            27824 
## 
## element (itemset/transaction) length distribution:
## sizes
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   6 248  87 331 261 381 303 332 340 296 276 238 181 179 123  97  66  46  39  28 
##  21  22  23  24  25  26 
##  15  13   3   5   2   2 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   6.000   8.500   8.919  12.000  26.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics
## 
## includes extended transaction information - examples:
##   transactionID
## 1          1000
## 2          1001
## 3          1002

Now, our dataset is in a form of transactions. It is necessary in order to perform the Association rules on the dataset.

Item Frequency

Item Frequency is a technique used to see which items occur the most in the dataset.

Frequency of the items:

itemFrequencyPlot(transactions, topN=15, type="absolute",ylim=c(0,2000), main="Item frequency", col="#56B4E9") 

From the graph above, it can be observed that the most frequent items in the dataset are whole milk and other vegetables.The same graph can be represented in form of the percentages:

itemFrequencyPlot(transactions, topN=15, type="relative",ylim=c(0,0.5), main="Item frequency", col="#56B4E9") 

From the item frequency plot, it can be deduced that the whole milk has been the most popular choice (almost 50%).

#Apriori

Apriori is an algorithm to create the association rules within the dataset. It has its own parameters like support level, confidence and lift. All of these parameters should be set to minimum standards but cannot be too much low, because no rules will be found without sufficient levels of support, confidence and lift.

rules <- apriori(transactions)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 389 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[167 item(s), 3898 transaction(s)] done [0.00s].
## sorting and recoding items ... [29 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

The defaults of apriori algorithm should be modified to get the optimal number of rules.

After several attempts with Apriori, I came out with the minimal support level of 0.1 and confidence of 0.5.

rules <- apriori(transactions,parameter=list(supp=0.1, conf=0.5))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 389 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[167 item(s), 3898 transaction(s)] done [0.00s].
## sorting and recoding items ... [29 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [5 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(rules)
## set of 5 rules
## 
## rule length distribution (lhs + rhs):sizes
## 2 
## 5 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2       2       2       2       2       2 
## 
## summary of quality measures:
##     support         confidence        coverage           lift      
##  Min.   :0.1070   Min.   :0.5082   Min.   :0.2060   Min.   :1.109  
##  1st Qu.:0.1124   1st Qu.:0.5106   1st Qu.:0.2137   1st Qu.:1.114  
##  Median :0.1506   Median :0.5193   Median :0.2830   Median :1.133  
##  Mean   :0.1480   Mean   :0.5192   Mean   :0.2858   Mean   :1.133  
##  3rd Qu.:0.1786   3rd Qu.:0.5258   3rd Qu.:0.3497   3rd Qu.:1.148  
##  Max.   :0.1914   Max.   :0.5322   Max.   :0.3766   Max.   :1.162  
##      count      
##  Min.   :417.0  
##  1st Qu.:438.0  
##  Median :587.0  
##  Mean   :576.8  
##  3rd Qu.:696.0  
##  Max.   :746.0  
## 
## mining info:
##          data ntransactions support confidence
##  transactions          3898     0.1        0.5
##                                                                    call
##  apriori(data = transactions, parameter = list(supp = 0.1, conf = 0.5))

Now, there are 5 rules found by the algorithm.

set.seed(240)
plot(rules, measure=c("support","lift"), shading="confidence", main="Groceries Transactions Rules")

The graph shows the link between the support level and lift on the groceries transaction rules.The rules with high lift and support level are the most red dots.

plot(rules, method="graph", measure="support", shading="lift")

From the graph, it is evident that the Whole Milk appears in all rules and was the most popular transaction.

Inspect

Now, support, confidence and lift levels can be inspected for a better understanding and visualization.

Support

inspect(sort(rules, by = "support"), linebreak = FALSE, decreasing=TRUE)
##     lhs                   rhs          support   confidence coverage  lift    
## [1] {other vegetables} => {whole milk} 0.1913802 0.5081744  0.3766034 1.109106
## [2] {rolls/buns}       => {whole milk} 0.1785531 0.5106383  0.3496665 1.114484
## [3] {yogurt}           => {whole milk} 0.1505900 0.5321850  0.2829656 1.161510
## [4] {bottled water}    => {whole milk} 0.1123653 0.5258103  0.2136993 1.147597
## [5] {sausage}          => {whole milk} 0.1069779 0.5193026  0.2060031 1.133394
##     count
## [1] 746  
## [2] 696  
## [3] 587  
## [4] 438  
## [5] 417

Confidence

inspect(sort(rules, by = "confidence"), linebreak = FALSE, decreasing=TRUE)
##     lhs                   rhs          support   confidence coverage  lift    
## [1] {yogurt}           => {whole milk} 0.1505900 0.5321850  0.2829656 1.161510
## [2] {bottled water}    => {whole milk} 0.1123653 0.5258103  0.2136993 1.147597
## [3] {sausage}          => {whole milk} 0.1069779 0.5193026  0.2060031 1.133394
## [4] {rolls/buns}       => {whole milk} 0.1785531 0.5106383  0.3496665 1.114484
## [5] {other vegetables} => {whole milk} 0.1913802 0.5081744  0.3766034 1.109106
##     count
## [1] 587  
## [2] 438  
## [3] 417  
## [4] 696  
## [5] 746

Lift

inspect(sort(rules, by = "lift"), linebreak = FALSE, decreasing=TRUE)
##     lhs                   rhs          support   confidence coverage  lift    
## [1] {yogurt}           => {whole milk} 0.1505900 0.5321850  0.2829656 1.161510
## [2] {bottled water}    => {whole milk} 0.1123653 0.5258103  0.2136993 1.147597
## [3] {sausage}          => {whole milk} 0.1069779 0.5193026  0.2060031 1.133394
## [4] {rolls/buns}       => {whole milk} 0.1785531 0.5106383  0.3496665 1.114484
## [5] {other vegetables} => {whole milk} 0.1913802 0.5081744  0.3766034 1.109106
##     count
## [1] 587  
## [2] 438  
## [3] 417  
## [4] 696  
## [5] 746
plot(rules, method="paracoord", control=list(reorder=TRUE))

This plot shows that yogurt has more link to the whole milk, as people when buying yogurt will most likely buy whole milk too.

rules_yogurt<-apriori(data=transactions, parameter=list(supp=0.01,conf = 0.005, minlen=2), appearance=list(default="rhs", lhs="yogurt"), control=list(verbose=F)) 
rules_yogurt_byconf<-sort(rules_yogurt, by="confidence", decreasing=TRUE)
inspect((rules_yogurt_byconf)[1:2], linebreak = FALSE)
##     lhs         rhs                support   confidence coverage  lift    count
## [1] {yogurt} => {whole milk}       0.1505900 0.532185   0.2829656 1.16151 587  
## [2] {yogurt} => {other vegetables} 0.1203181 0.425204   0.2829656 1.12905 469

This means that people who buy yogurt, will most likely buy other vegetables too as their second choice (whole milk is obviously the best choice).

rules_whole_milk<-apriori(data=transactions, parameter=list(supp=0.01,conf = 0.005, minlen=2), appearance=list(default="rhs", lhs="whole milk"), control=list(verbose=F)) 
rules_whole_milk_byconf<-sort(rules_whole_milk, by="confidence", decreasing=TRUE)
inspect((rules_whole_milk_byconf)[1:2], linebreak = FALSE)
##     lhs             rhs                support   confidence coverage  lift    
## [1] {whole milk} => {other vegetables} 0.1913802 0.4176932  0.4581837 1.109106
## [2] {whole milk} => {rolls/buns}       0.1785531 0.3896976  0.4581837 1.114484
##     count
## [1] 746  
## [2] 696

People who buy whole milk, there is a high possibility that they will buy other vegetables too.

Conclusion

The analysis provided insights into customer purchasing grocery patterns through the method called association rules. It was proved through the visualisation of graphs and other techniques of the apriori algorithm that the most popular choice is whole milk. It dominated all 5 rules. The methods used in this project can be used in other datasets too.