Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’. That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached. Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Libraries Needed

library(arules)
library(arulesViz)
library(cluster)
library(factoextra)
transaction <- read.transactions('https://raw.githubusercontent.com/Jlok17/2022MSDS/main/Source/Data%20624/GroceryDataSet.csv', sep = ',', header = FALSE)
head(transaction)
## transactions in sparse format with
##  6 transactions (rows) and
##  169 items (columns)
str(transaction)
## Formal class 'transactions' [package "arules"] with 3 slots
##   ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
##   .. .. ..@ i       : int [1:43367] 29 88 118 132 33 157 167 166 38 91 ...
##   .. .. ..@ p       : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ...
##   .. .. ..@ Dim     : int [1:2] 169 9835
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : NULL
##   .. .. .. ..$ : NULL
##   .. .. ..@ factors : list()
##   ..@ itemInfo   :'data.frame':  169 obs. of  1 variable:
##   .. ..$ labels: chr [1:169] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "baby food" ...
##   ..@ itemsetInfo:'data.frame':  0 obs. of  0 variables

Rule Creation

For creating the rules, I set up a minimum threshold to be considered. I use a minimum value of .002 for support since I wanted the item set to appear in at least of 0.2% of all the transaction so the very rare item combinations are filtered out being statistically insignificant but still shows the less common item combination. I utilize a confidence value of 0.5 since I wanted the reliability of the inference to be a balance where the resulting consequent is the result of the antecedent at least 50% of the time as well as not overly caution or too lenient.

# Generating rules
rules <- apriori(transaction, parameter = list(supp = 0.002, conf = 0.5))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.002      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 19 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [147 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.01s].
## writing ... [1098 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Top 10 Sorted by lift 
rules_sorted <- sort(rules, by = "lift", decreasing = TRUE)
inspect(head(rules_sorted, 10))
##      lhs                     rhs                      support confidence    coverage     lift count
## [1]  {butter,                                                                                      
##       hard cheese}        => {whipped/sour cream} 0.002033554  0.5128205 0.003965430 7.154028    20
## [2]  {beef,                                                                                        
##       citrus fruit,                                                                                
##       other vegetables}   => {root vegetables}    0.002135231  0.6363636 0.003355363 5.838280    21
## [3]  {citrus fruit,                                                                                
##       other vegetables,                                                                            
##       tropical fruit,                                                                              
##       whole milk}         => {root vegetables}    0.003152008  0.6326531 0.004982206 5.804238    31
## [4]  {citrus fruit,                                                                                
##       frozen vegetables,                                                                           
##       other vegetables}   => {root vegetables}    0.002033554  0.6250000 0.003253686 5.734025    20
## [5]  {beef,                                                                                        
##       other vegetables,                                                                            
##       tropical fruit}     => {root vegetables}    0.002745297  0.6136364 0.004473818 5.629770    27
## [6]  {bottled water,                                                                               
##       root vegetables,                                                                             
##       yogurt}             => {tropical fruit}     0.002236909  0.5789474 0.003863752 5.517391    22
## [7]  {herbs,                                                                                       
##       other vegetables,                                                                            
##       whole milk}         => {root vegetables}    0.002440264  0.6000000 0.004067107 5.504664    24
## [8]  {grapes,                                                                                      
##       pip fruit}          => {tropical fruit}     0.002135231  0.5675676 0.003762074 5.408941    21
## [9]  {herbs,                                                                                       
##       yogurt}             => {root vegetables}    0.002033554  0.5714286 0.003558719 5.242537    20
## [10] {beef,                                                                                        
##       other vegetables,                                                                            
##       soda}               => {root vegetables}    0.002033554  0.5714286 0.003558719 5.242537    20
plot(rules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

# Item Frequency Matrix
itemFrequencyPlot(transaction, topN = 20)

So briefly, lift in rule mining is the metric in order to measure the strength of the rule association. Value of Lift greater than 1 show cases that the antecedent and consequent occur together more often than if they were independent and show cases a positive relationship between the 2. If lift is equal to 1, it would suggest that the antecedent and consequent are independent of each other. If lift is less than 1, it would mean that they occur less frequent together than expected which would mean that the relationship between the 2 items are substitute. So for Rule 1 we see that “Butter, Hard Cheese” correlates to “whipped/Sour Cream” with a value of 7.154. This means that these items are bough together 7.154 times more than the likelihood of buying “whipped/sour cream” randomly. This information can be used in order to determine if there is a possible marketing strategies with coupons and promotions or even go to the point of creating a store layout where these items are far from each other so the consumer has to travel through more of the store which could generate more revenue.

Clustering

I utilized factoextra for clustering but didn’t seem to yield any good results. As this will continue being a work in progress for me.

# Hierarchical Clustering
# Converting transactions to a binary matrix
itemMatrix <- as(transaction, "ngCMatrix")

# Transpose the itemMatrix
transposedMatrix <- t(itemMatrix)

# Computing Euclidean distance
dist_items <- dist(transposedMatrix, method = "euclidean")


hclust_items <- hclust(dist_items)
plot(hclust_items)

# Cutting Tree to Clusters
clusters <- cutree(hclust_items, k = 5)
fviz_dend(hclust_items, k = 5, rect = TRUE, cex = 0.5)