Association rules - grocery dataset

Agenda

Introduction: A brief overview of Association Rule Mining.
Dataset loading and overview: Loading the dataset and providing an overview of its structure and contents.
Preprocessing: Cleaning and preparing the data for analysis.
Analysis and Interpretation of Results: Calculation of support, confidence, and lift metrics, generation of rules, visualization of results.
Conclusion: Summary of the findings from the analysis.

Introduction

Association rules are a powerful data mining technique that helps to uncover relationships between variables in datasets. In the context of a grocery dataset, it is possible to identify frequently purchased items and any associations between them. This information can then be used to inform sales and marketing strategies, such as creating product bundles, cross-selling, and up-selling. We will start by loading the data, then preprocessing it, generating rules, visualizing the results, and finally, analyzing and interpreting the findings. The ultimate goal of this analysis could be to provide recommendations that can be used to develop marketing strategies. Dataset link: https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset

Dataset loading and overview

First let’s load the data and packages then take a look at what we’re working with.

# reading the packages
library(arules)

## Ładowanie wymaganego pakietu: Matrix

## 
## Dołączanie pakietu: 'arules'

## Następujące obiekty zostały zakryte z 'package:base':
## 
##     abbreviate, write

library(arulesViz)
library(arulesCBA)

trans = read.transactions('data/Groceries_dataset.csv', format = "single", sep = ",", cols = c("Member_number", "itemDescription"), header = TRUE)

inspect(trans[1:10])

##      items                        transactionID
## [1]  {canned beer,                             
##       hygiene articles,                        
##       misc. beverages,                         
##       pastry,                                  
##       pickled vegetables,                      
##       salty snack,                             
##       sausage,                                 
##       semi-finished bread,                     
##       soda,                                    
##       whole milk,                              
##       yogurt}                              1000
## [2]  {beef,                                    
##       curd,                                    
##       frankfurter,                             
##       rolls/buns,                              
##       sausage,                                 
##       soda,                                    
##       whipped/sour cream,                      
##       white bread,                             
##       whole milk}                          1001
## [3]  {butter,                                  
##       butter milk,                             
##       frozen vegetables,                       
##       other vegetables,                        
##       specialty chocolate,                     
##       sugar,                                   
##       tropical fruit,                          
##       whole milk}                          1002
## [4]  {dental care,                             
##       detergent,                               
##       frozen meals,                            
##       rolls/buns,                              
##       root vegetables,                         
##       sausage}                             1003
## [5]  {canned beer,                             
##       chocolate,                               
##       cling film/bags,                         
##       dish cleaner,                            
##       frozen fish,                             
##       hygiene articles,                        
##       other vegetables,                        
##       packaged fruit/vegetables,               
##       pastry,                                  
##       pip fruit,                               
##       red/blush wine,                          
##       rolls/buns,                              
##       root vegetables,                         
##       shopping bags,                           
##       tropical fruit,                          
##       whole milk}                          1004
## [6]  {margarine,                               
##       rolls/buns,                              
##       whipped/sour cream}                  1005
## [7]  {bottled beer,                            
##       bottled water,                           
##       chicken,                                 
##       chocolate,                               
##       flour,                                   
##       frankfurter,                             
##       rice,                                    
##       rolls/buns,                              
##       shopping bags,                           
##       skin care,                               
##       softener,                                
##       whole milk}                          1006
## [8]  {dessert,                                 
##       domestic eggs,                           
##       hamburger meat,                          
##       liquor (appetizer),                      
##       liver loaf,                              
##       photo/film,                              
##       root vegetables,                         
##       soda,                                    
##       tropical fruit,                          
##       white wine,                              
##       yogurt}                              1008
## [9]  {canned fish,                             
##       cocoa drinks,                            
##       herbs,                                   
##       ketchup,                                 
##       newspapers,                              
##       pastry,                                  
##       tropical fruit,                          
##       yogurt}                              1009
## [10] {bottled water,                           
##       candles,                                 
##       coffee,                                  
##       frankfurter,                             
##       kitchen towels,                          
##       pip fruit,                               
##       rolls/buns,                              
##       sliced cheese,                           
##       specialty bar,                           
##       UHT-milk}                            1010

We can notice that the transactions are quite big and diverse which could mean the analysis will be full of interesting insights. Let’s check the frequency of items in transactions relative to each other.

itemFrequencyPlot(trans, topN=15, type="relative", main="Grocery item frequency")

We see that milk is in over 40% of the transactions! Vegetables, bakings and soda or yogurts are also all very high in frequency. It is quite expected as these items are food and drink items necessary for survival and bought very often. We should also investigate the summary of this data.

summary(trans)

## transactions as itemMatrix in sparse format with
##  3898 rows (elements/itemsets/transactions) and
##  167 columns (items) and a density of 0.05340678 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             1786             1468             1363             1222 
##           yogurt          (Other) 
##             1103            27824 
## 
## element (itemset/transaction) length distribution:
## sizes
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   6 248  87 331 261 381 303 332 340 296 276 238 181 179 123  97  66  46  39  28 
##  21  22  23  24  25  26 
##  15  13   3   5   2   2 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   6.000   8.500   8.919  12.000  26.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics
## 
## includes extended transaction information - examples:
##   transactionID
## 1          1000
## 2          1001
## 3          1002

Here we can see the absolute number of transaction of the top items and some basic descriptive statistics about this dataset. Let’s move on to

Preprocessing

This data doesn’t really need any preprocessing as the read.transaction function practically did most of the hard work for us, so let’s move on to anaysis.

Analysis of Rule Association mining

rules<-apriori(trans)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 389 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[167 item(s), 3898 transaction(s)] done [0.00s].
## sorting and recoding items ... [29 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

The results of the Apriori algorithm indicate that it did not generate any rules. This means that the algorithm did not find any significant associations or patterns between items in the transactions data. The reason probably is that the minimum support count was set too high. It determines the minimum number of transactions that an itemset must appear in to be considered significant. If it was set too high, it could have filtered out all possible rules, leading to the result of no rules generated. Let’s try a lower support and confidence threshold.

rules <- apriori(trans, parameter=list(supp=0.03, conf=0.60, minlen=2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5    0.03      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 116 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[167 item(s), 3898 transaction(s)] done [0.00s].
## sorting and recoding items ... [72 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [8 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

These results show that the algorithm was able to generate 8 rules from the data! This means that we have found some significant associations between items in the data! Compared to the previous results, by setting a lower minimum support and confidence threshold, the algorithm was able to find more rules and uncover some patterns in the data. Now that we have some rules to work with we can plot them.

set.seed(42) 
plot(rules, method="graph", measure="support", shading="lift", main="Grocery rules")

## Warning: Unknown control parameters: main

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

Here we can see the acquired rules with color being lift and size being support. Let’s see some other plots as well.

plot(rules, method="paracoord", control=list(reorder=TRUE))

plot(rules, shading="order", control=list(main="Two-key plot"))

With these plots we can notice that every rule ends with milk in our case which is not surprising as it is the most common item transactions, but we can also notice the longer rule chains leading to it. We should inspect the rules in table form.

By support

The support metric represents the frequency of occurrence of the lhs and rhs items together in the transactions. A higher support value indicates that the items occur more frequently together in transactions.

inspect(sort(rules, by = "support"), linebreak = FALSE)

##     lhs                                       rhs          support   
## [1] {rolls/buns, shopping bags}            => {whole milk} 0.04130323
## [2] {bottled water, yogurt}                => {whole milk} 0.04027707
## [3] {bottled beer, rolls/buns}             => {whole milk} 0.03822473
## [4] {pastry, yogurt}                       => {whole milk} 0.03488969
## [5] {other vegetables, rolls/buns, yogurt} => {whole milk} 0.03437660
## [6] {shopping bags, yogurt}                => {whole milk} 0.03309389
## [7] {other vegetables, rolls/buns, soda}   => {whole milk} 0.03181119
## [8] {beef, other vegetables}               => {whole milk} 0.03052848
##     confidence coverage   lift     count
## [1] 0.6007463  0.06875321 1.311147 161  
## [2] 0.6061776  0.06644433 1.323001 157  
## [3] 0.6056911  0.06310929 1.321939 149  
## [4] 0.6017699  0.05797845 1.313381 136  
## [5] 0.6568627  0.05233453 1.433623 134  
## [6] 0.6028037  0.05489995 1.315638 129  
## [7] 0.6048780  0.05259107 1.320165 124  
## [8] 0.6010101  0.05079528 1.311723 119

By confidence

The confidence metric represents the proportion of transactions containing the lhs items that also contain the rhs item. A higher confidence value indicates a stronger relationship between the lhs and rhs items.

inspect(sort(rules, by = "confidence"), linebreak = FALSE)

##     lhs                                       rhs          support   
## [1] {other vegetables, rolls/buns, yogurt} => {whole milk} 0.03437660
## [2] {bottled water, yogurt}                => {whole milk} 0.04027707
## [3] {bottled beer, rolls/buns}             => {whole milk} 0.03822473
## [4] {other vegetables, rolls/buns, soda}   => {whole milk} 0.03181119
## [5] {shopping bags, yogurt}                => {whole milk} 0.03309389
## [6] {pastry, yogurt}                       => {whole milk} 0.03488969
## [7] {beef, other vegetables}               => {whole milk} 0.03052848
## [8] {rolls/buns, shopping bags}            => {whole milk} 0.04130323
##     confidence coverage   lift     count
## [1] 0.6568627  0.05233453 1.433623 134  
## [2] 0.6061776  0.06644433 1.323001 157  
## [3] 0.6056911  0.06310929 1.321939 149  
## [4] 0.6048780  0.05259107 1.320165 124  
## [5] 0.6028037  0.05489995 1.315638 129  
## [6] 0.6017699  0.05797845 1.313381 136  
## [7] 0.6010101  0.05079528 1.311723 119  
## [8] 0.6007463  0.06875321 1.311147 161

By lift

The lift metric represents the strength of association between the lhs and rhs items, compared to their expected occurrence if they were independent of each other. A lift value greater than 1 indicates a positive association between the items, while a lift value less than 1 indicates a negative association.

inspect(sort(rules, by = "lift"), linebreak = FALSE)

##     lhs                                       rhs          support   
## [1] {other vegetables, rolls/buns, yogurt} => {whole milk} 0.03437660
## [2] {bottled water, yogurt}                => {whole milk} 0.04027707
## [3] {bottled beer, rolls/buns}             => {whole milk} 0.03822473
## [4] {other vegetables, rolls/buns, soda}   => {whole milk} 0.03181119
## [5] {shopping bags, yogurt}                => {whole milk} 0.03309389
## [6] {pastry, yogurt}                       => {whole milk} 0.03488969
## [7] {beef, other vegetables}               => {whole milk} 0.03052848
## [8] {rolls/buns, shopping bags}            => {whole milk} 0.04130323
##     confidence coverage   lift     count
## [1] 0.6568627  0.05233453 1.433623 134  
## [2] 0.6061776  0.06644433 1.323001 157  
## [3] 0.6056911  0.06310929 1.321939 149  
## [4] 0.6048780  0.05259107 1.320165 124  
## [5] 0.6028037  0.05489995 1.315638 129  
## [6] 0.6017699  0.05797845 1.313381 136  
## [7] 0.6010101  0.05079528 1.311723 119  
## [8] 0.6007463  0.06875321 1.311147 161

These tables pretty much look as expected. Sorting by different metrics doesn’t really give us any more insight. We should perform our own analysis with some assumptions and questions to be answered.

Some aditional interesting anlysis

If we know the items that lead to buying milk - then what does buying milk lead to?

rules.milk<-apriori(data=trans, parameter=list(supp=0.001, conf=0.05, minlen=2), appearance=list(default="rhs",lhs="whole milk"), control=list(verbose=F)) 

rules.milk.byconf<-sort(rules.milk, by="confidence", decreasing=TRUE)
inspect(head(rules.milk.byconf))

##     lhs             rhs                support   confidence coverage  lift    
## [1] {whole milk} => {other vegetables} 0.1913802 0.4176932  0.4581837 1.109106
## [2] {whole milk} => {rolls/buns}       0.1785531 0.3896976  0.4581837 1.114484
## [3] {whole milk} => {soda}             0.1511031 0.3297872  0.4581837 1.051973
## [4] {whole milk} => {yogurt}           0.1505900 0.3286674  0.4581837 1.161510
## [5] {whole milk} => {tropical fruit}   0.1164700 0.2541993  0.4581837 1.087672
## [6] {whole milk} => {root vegetables}  0.1131349 0.2469205  0.4581837 1.070630
##     count
## [1] 746  
## [2] 696  
## [3] 589  
## [4] 587  
## [5] 454  
## [6] 441

Unsurprisingly, buying milk gives the same items as in the case where it was the other way around.

Based on pure curiosity, what items lead to buying pet care products?

rules.pet<-apriori(data=trans, parameter=list(supp=0.001, conf=0.005, minlen=2), appearance=list(default="lhs",rhs="pet care"), control=list(verbose=F)) 

rules.pet.byconf<-sort(rules.pet, by="confidence", decreasing=TRUE)
inspect(head(rules.pet.byconf), linebreak = FALSE)

##     lhs                                                          rhs       
## [1] {baking powder, citrus fruit, other vegetables, pastry}   => {pet care}
## [2] {baking powder, citrus fruit, pastry}                     => {pet care}
## [3] {long life bakery product, pastry, rolls/buns, soda}      => {pet care}
## [4] {baking powder, other vegetables, pastry, whole milk}     => {pet care}
## [5] {citrus fruit, frankfurter, pastry, soda}                 => {pet care}
## [6] {brown bread, other vegetables, rolls/buns, soda, yogurt} => {pet care}
##     support     confidence coverage    lift     count
## [1] 0.001026167 0.6666667  0.001539251 30.57255 4    
## [2] 0.001026167 0.5714286  0.001795793 26.20504 4    
## [3] 0.001026167 0.3333333  0.003078502 15.28627 4    
## [4] 0.001026167 0.3076923  0.003335044 14.11041 4    
## [5] 0.001026167 0.3076923  0.003335044 14.11041 4    
## [6] 0.001282709 0.2941176  0.004361211 13.48789 5

Interestingly, there is some association between baking powder and citrus fruits with pet care items? It is hard to explain this result.

What items are associated with instant coffee?

rules.coffee<-apriori(data=trans, parameter=list(supp=0.001, conf=0.005, minlen=2), appearance=list(default="lhs",rhs="instant coffee"), control=list(verbose=F)) 

rules.coffee.byconf<-sort(rules.coffee, by="confidence", decreasing=TRUE)
inspect(head(rules.coffee.byconf), linebreak = FALSE)

##     lhs                                                                           
## [1] {beef, butter milk, other vegetables, root vegetables}                        
## [2] {mayonnaise, rolls/buns, yogurt}                                              
## [3] {newspapers, other vegetables, rolls/buns, tropical fruit, yogurt}            
## [4] {long life bakery product, root vegetables, shopping bags}                    
## [5] {bottled water, frankfurter, other vegetables, rolls/buns, whole milk, yogurt}
## [6] {chewing gum, margarine, other vegetables}                                    
##        rhs              support     confidence coverage    lift     count
## [1] => {instant coffee} 0.001026167 0.5000000  0.002052335 33.03390 4    
## [2] => {instant coffee} 0.001026167 0.4000000  0.002565418 26.42712 4    
## [3] => {instant coffee} 0.001026167 0.3636364  0.002821960 24.02465 4    
## [4] => {instant coffee} 0.001026167 0.3076923  0.003335044 20.32855 4    
## [5] => {instant coffee} 0.001026167 0.3076923  0.003335044 20.32855 4    
## [6] => {instant coffee} 0.001026167 0.2857143  0.003591585 18.87651 4

Apparently, people who buy newspapers or chewing gum buy instant coffee as well. Maybe they read the newspaper while drinking coffee and chew a gum after to get rid of coffee breath.

When people buy liquor, what do they also buy?

rules.liquor<-apriori(data=trans, parameter=list(supp=0.001, conf=0.005, minlen=2), appearance=list(default="rhs",lhs="liquor"), control=list(verbose=F)) 

rules.liquor.byconf<-sort(rules.liquor, by="confidence", decreasing=TRUE)
inspect(head(rules.liquor.byconf, 15), linebreak = FALSE)

##      lhs         rhs                support     confidence coverage   lift    
## [1]  {liquor} => {whole milk}       0.016675218 0.6310680  0.02642381 1.377325
## [2]  {liquor} => {other vegetables} 0.012827091 0.4854369  0.02642381 1.288987
## [3]  {liquor} => {rolls/buns}       0.011287840 0.4271845  0.02642381 1.221691
## [4]  {liquor} => {soda}             0.010774756 0.4077670  0.02642381 1.300717
## [5]  {liquor} => {yogurt}           0.010261673 0.3883495  0.02642381 1.372426
## [6]  {liquor} => {tropical fruit}   0.009235505 0.3495146  0.02642381 1.495508
## [7]  {liquor} => {root vegetables}  0.008209338 0.3106796  0.02642381 1.347085
## [8]  {liquor} => {sausage}          0.007439713 0.2815534  0.02642381 1.366744
## [9]  {liquor} => {bottled water}    0.007439713 0.2815534  0.02642381 1.317521
## [10] {liquor} => {shopping bags}    0.006926629 0.2621359  0.02642381 1.557631
## [11] {liquor} => {newspapers}       0.006157004 0.2330097  0.02642381 1.666554
## [12] {liquor} => {bottled beer}     0.005387378 0.2038835  0.02642381 1.283906
## [13] {liquor} => {citrus fruit}     0.005387378 0.2038835  0.02642381 1.099222
## [14] {liquor} => {chicken}          0.005130836 0.1941748  0.02642381 1.930850
## [15] {liquor} => {brown bread}      0.005130836 0.1941748  0.02642381 1.428100
##      count
## [1]  65   
## [2]  50   
## [3]  44   
## [4]  42   
## [5]  40   
## [6]  36   
## [7]  32   
## [8]  29   
## [9]  29   
## [10] 27   
## [11] 24   
## [12] 21   
## [13] 21   
## [14] 20   
## [15] 20

Usually normal day-to-day products. Though tropical and citrus fruits are interesting, maybe they make cocktails with them? Let’s ask one last interesting question.

When people buy frozen meals, what do they also buy?

rules.frozen<-apriori(data=trans, parameter=list(supp=0.001, conf=0.005, minlen=2), appearance=list(default="rhs",lhs="frozen meals"), control=list(verbose=F)) 

rules.frozen.byconf<-sort(rules.frozen, by="confidence", decreasing=TRUE)
inspect(head(rules.frozen.byconf, 20), linebreak = FALSE)

##      lhs               rhs                     support    confidence coverage  
## [1]  {frozen meals} => {whole milk}            0.03258081 0.5183673  0.06285274
## [2]  {frozen meals} => {rolls/buns}            0.02821960 0.4489796  0.06285274
## [3]  {frozen meals} => {other vegetables}      0.02770652 0.4408163  0.06285274
## [4]  {frozen meals} => {yogurt}                0.02077989 0.3306122  0.06285274
## [5]  {frozen meals} => {soda}                  0.02077989 0.3306122  0.06285274
## [6]  {frozen meals} => {tropical fruit}        0.01770139 0.2816327  0.06285274
## [7]  {frozen meals} => {sausage}               0.01590559 0.2530612  0.06285274
## [8]  {frozen meals} => {bottled water}         0.01539251 0.2448980  0.06285274
## [9]  {frozen meals} => {root vegetables}       0.01462288 0.2326531  0.06285274
## [10] {frozen meals} => {citrus fruit}          0.01436634 0.2285714  0.06285274
## [11] {frozen meals} => {canned beer}           0.01205747 0.1918367  0.06285274
## [12] {frozen meals} => {pip fruit}             0.01180092 0.1877551  0.06285274
## [13] {frozen meals} => {pastry}                0.01180092 0.1877551  0.06285274
## [14] {frozen meals} => {bottled beer}          0.01154438 0.1836735  0.06285274
## [15] {frozen meals} => {whipped/sour cream}    0.01128784 0.1795918  0.06285274
## [16] {frozen meals} => {brown bread}           0.01103130 0.1755102  0.06285274
## [17] {frozen meals} => {shopping bags}         0.01103130 0.1755102  0.06285274
## [18] {frozen meals} => {fruit/vegetable juice} 0.01077476 0.1714286  0.06285274
## [19] {frozen meals} => {newspapers}            0.01051821 0.1673469  0.06285274
## [20] {frozen meals} => {domestic eggs}         0.01026167 0.1632653  0.06285274
##      lift     count
## [1]  1.131353 127  
## [2]  1.284022 110  
## [3]  1.170505 108  
## [4]  1.168383  81  
## [5]  1.054604  81  
## [6]  1.205054  69  
## [7]  1.228434  62  
## [8]  1.145993  60  
## [9]  1.008767  57  
## [10] 1.232326  56  
## [11] 1.161148  47  
## [12] 1.100555  46  
## [13] 1.057615  46  
## [14] 1.156638  45  
## [15] 1.160944  44  
## [16] 1.290828  43  
## [17] 1.042894  43  
## [18] 1.372133  42  
## [19] 1.196914  41  
## [20] 1.226220  40

Seems like they buy ordinary items as well. Though there is some transactions with beer in them. After answering some questions to get some interesting insight we can conclude this work.

Conclusion

The analysis of the grocery dataset through association rules provided valuable insights into the purchasing patterns of customers. The application of the association rule mining technique allowed us to answer interesting questions and uncover relationships between different items. Additionally, the results were effectively visualized through various plots, making it easier to understand and interpret the findings. However, the dataset has some limitations, such as its specificity in terms of item groups, with some of them being very similar, making it challenging to properly analyze the situation. Despite this, the analysis serves as a foundation for future studies and provides valuable information for the optimization of retail operations.