Grocery Data Set

This data set contains 1 month (30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. The data set contains 9835 transactions and the items are aggregated to 169 categories. The following exercise uses market basket analysis to attempt to meet the objective of mining the top 5 association rules for the data set.

library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
data("Groceries")
dim(Groceries)
## [1] 9835  169
set.seed(1)
Groceries <- as(Groceries, "transactions")
Groceries
## transactions in sparse format with
##  9835 transactions (rows) and
##  169 items (columns)
summary(Groceries)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55 
##   16   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   46   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##        labels  level2           level1
## 1 frankfurter sausage meat and sausage
## 2     sausage sausage meat and sausage
## 3  liver loaf sausage meat and sausage

The data set summary shows the top five most frequently purchased items. It also shows that most transactions were for the purchase of four items or less.

To observe some of the more frequently purchased items a bar chart is included in the next section of code.

aa=as(Groceries,"matrix") # transforms transaction matrix into incidence matrix
aa[1:2,]   # print the first two rows of the incidence matrix
##      frankfurter sausage liver loaf   ham  meat finished products
## [1,]       FALSE   FALSE      FALSE FALSE FALSE             FALSE
## [2,]       FALSE   FALSE      FALSE FALSE FALSE             FALSE
##      organic sausage chicken turkey  pork  beef hamburger meat  fish
## [1,]           FALSE   FALSE  FALSE FALSE FALSE          FALSE FALSE
## [2,]           FALSE   FALSE  FALSE FALSE FALSE          FALSE FALSE
##      citrus fruit tropical fruit pip fruit grapes berries nuts/prunes
## [1,]         TRUE          FALSE     FALSE  FALSE   FALSE       FALSE
## [2,]        FALSE           TRUE     FALSE  FALSE   FALSE       FALSE
##      root vegetables onions herbs other vegetables
## [1,]           FALSE  FALSE FALSE            FALSE
## [2,]           FALSE  FALSE FALSE            FALSE
##      packaged fruit/vegetables whole milk butter  curd dessert butter milk
## [1,]                     FALSE      FALSE  FALSE FALSE   FALSE       FALSE
## [2,]                     FALSE      FALSE  FALSE FALSE   FALSE       FALSE
##      yogurt whipped/sour cream beverages UHT-milk condensed milk cream
## [1,]  FALSE              FALSE     FALSE    FALSE          FALSE FALSE
## [2,]   TRUE              FALSE     FALSE    FALSE          FALSE FALSE
##      soft cheese sliced cheese hard cheese cream cheese  processed cheese
## [1,]       FALSE         FALSE       FALSE         FALSE            FALSE
## [2,]       FALSE         FALSE       FALSE         FALSE            FALSE
##      spread cheese curd cheese specialty cheese mayonnaise salad dressing
## [1,]         FALSE       FALSE            FALSE      FALSE          FALSE
## [2,]         FALSE       FALSE            FALSE      FALSE          FALSE
##      tidbits frozen vegetables frozen fruits frozen meals frozen fish
## [1,]   FALSE             FALSE         FALSE        FALSE       FALSE
## [2,]   FALSE             FALSE         FALSE        FALSE       FALSE
##      frozen chicken ice cream frozen dessert frozen potato products
## [1,]          FALSE     FALSE          FALSE                  FALSE
## [2,]          FALSE     FALSE          FALSE                  FALSE
##      domestic eggs rolls/buns white bread brown bread pastry
## [1,]         FALSE      FALSE       FALSE       FALSE  FALSE
## [2,]         FALSE      FALSE       FALSE       FALSE  FALSE
##      roll products  semi-finished bread zwieback potato products flour
## [1,]          FALSE                TRUE    FALSE           FALSE FALSE
## [2,]          FALSE               FALSE    FALSE           FALSE FALSE
##       salt  rice pasta vinegar   oil margarine specialty fat sugar
## [1,] FALSE FALSE FALSE   FALSE FALSE      TRUE         FALSE FALSE
## [2,] FALSE FALSE FALSE   FALSE FALSE     FALSE         FALSE FALSE
##      artif. sweetener honey mustard ketchup spices soups ready soups
## [1,]            FALSE FALSE   FALSE   FALSE  FALSE FALSE        TRUE
## [2,]            FALSE FALSE   FALSE   FALSE  FALSE FALSE       FALSE
##      Instant food products sauces cereals organic products baking powder
## [1,]                 FALSE  FALSE   FALSE            FALSE         FALSE
## [2,]                 FALSE  FALSE   FALSE            FALSE         FALSE
##      preservation products pudding powder canned vegetables canned fruit
## [1,]                 FALSE          FALSE             FALSE        FALSE
## [2,]                 FALSE          FALSE             FALSE        FALSE
##      pickled vegetables specialty vegetables   jam sweet spreads
## [1,]              FALSE                FALSE FALSE         FALSE
## [2,]              FALSE                FALSE FALSE         FALSE
##      meat spreads canned fish dog food cat food pet care baby food coffee
## [1,]        FALSE       FALSE    FALSE    FALSE    FALSE     FALSE  FALSE
## [2,]        FALSE       FALSE    FALSE    FALSE    FALSE     FALSE   TRUE
##      instant coffee   tea cocoa drinks bottled water  soda misc. beverages
## [1,]          FALSE FALSE        FALSE         FALSE FALSE           FALSE
## [2,]          FALSE FALSE        FALSE         FALSE FALSE           FALSE
##      fruit/vegetable juice syrup bottled beer canned beer brandy whisky
## [1,]                 FALSE FALSE        FALSE       FALSE  FALSE  FALSE
## [2,]                 FALSE FALSE        FALSE       FALSE  FALSE  FALSE
##      liquor   rum liqueur liquor (appetizer) white wine red/blush wine
## [1,]  FALSE FALSE   FALSE              FALSE      FALSE          FALSE
## [2,]  FALSE FALSE   FALSE              FALSE      FALSE          FALSE
##      prosecco sparkling wine salty snack popcorn nut snack snack products
## [1,]    FALSE          FALSE       FALSE   FALSE     FALSE          FALSE
## [2,]    FALSE          FALSE       FALSE   FALSE     FALSE          FALSE
##      long life bakery product waffles cake bar chewing gum chocolate
## [1,]                    FALSE   FALSE    FALSE       FALSE     FALSE
## [2,]                    FALSE   FALSE    FALSE       FALSE     FALSE
##      cooking chocolate specialty chocolate specialty bar
## [1,]             FALSE               FALSE         FALSE
## [2,]             FALSE               FALSE         FALSE
##      chocolate marshmallow candy seasonal products detergent softener
## [1,]                 FALSE FALSE             FALSE     FALSE    FALSE
## [2,]                 FALSE FALSE             FALSE     FALSE    FALSE
##      decalcifier dish cleaner abrasive cleaner cleaner toilet cleaner
## [1,]       FALSE        FALSE            FALSE   FALSE          FALSE
## [2,]       FALSE        FALSE            FALSE   FALSE          FALSE
##      bathroom cleaner hair spray dental care male cosmetics
## [1,]            FALSE      FALSE       FALSE          FALSE
## [2,]            FALSE      FALSE       FALSE          FALSE
##      make up remover skin care female sanitary products baby cosmetics
## [1,]           FALSE     FALSE                    FALSE          FALSE
## [2,]           FALSE     FALSE                    FALSE          FALSE
##       soap rubbing alcohol hygiene articles napkins dishes cookware
## [1,] FALSE           FALSE            FALSE   FALSE  FALSE    FALSE
## [2,] FALSE           FALSE            FALSE   FALSE  FALSE    FALSE
##      kitchen utensil cling film/bags kitchen towels house keeping products
## [1,]           FALSE           FALSE          FALSE                  FALSE
## [2,]           FALSE           FALSE          FALSE                  FALSE
##      candles light bulbs sound storage medium newspapers photo/film
## [1,]   FALSE       FALSE                FALSE      FALSE      FALSE
## [2,]   FALSE       FALSE                FALSE      FALSE      FALSE
##      pot plants flower soil/fertilizer flower (seeds) shopping bags  bags
## [1,]      FALSE                  FALSE          FALSE         FALSE FALSE
## [2,]      FALSE                  FALSE          FALSE         FALSE FALSE
itemFrequencyPlot(Groceries[, itemFrequency(Groceries) > 0.06], cex.names = 1)

From the graph it can be observed that the top four items appear in over 15% of the transaction.

Overall association rules are mined from the data and a summary is run to provide basic descriptive statistics.

rules <- apriori(Groceries, parameter = list(support = 0.001, confidence = 0.8))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [410 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
rules
## set of 410 rules
summary(rules)
## set of 410 rules
## 
## rule length distribution (lhs + rhs):sizes
##   3   4   5   6 
##  29 229 140  12 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   4.000   4.329   5.000   6.000 
## 
## summary of quality measures:
##     support           confidence          lift            count      
##  Min.   :0.001017   Min.   :0.8000   Min.   : 3.131   Min.   :10.00  
##  1st Qu.:0.001017   1st Qu.:0.8333   1st Qu.: 3.312   1st Qu.:10.00  
##  Median :0.001220   Median :0.8462   Median : 3.588   Median :12.00  
##  Mean   :0.001247   Mean   :0.8663   Mean   : 3.951   Mean   :12.27  
##  3rd Qu.:0.001322   3rd Qu.:0.9091   3rd Qu.: 4.341   3rd Qu.:13.00  
##  Max.   :0.003152   Max.   :1.0000   Max.   :11.235   Max.   :31.00  
## 
## mining info:
##       data ntransactions support confidence
##  Groceries          9835   0.001        0.8

The rules are sorted by confidence in descending order to observe which five rules are most likely to occur.

inspect(sort(rules, by = "confidence")[1:10])
##      lhs                     rhs                    support confidence     lift count
## [1]  {rice,                                                                          
##       sugar}              => {whole milk}       0.001220132          1 3.913649    12
## [2]  {canned fish,                                                                   
##       hygiene articles}   => {whole milk}       0.001118454          1 3.913649    11
## [3]  {root vegetables,                                                               
##       butter,                                                                        
##       rice}               => {whole milk}       0.001016777          1 3.913649    10
## [4]  {root vegetables,                                                               
##       whipped/sour cream,                                                            
##       flour}              => {whole milk}       0.001728521          1 3.913649    17
## [5]  {butter,                                                                        
##       soft cheese,                                                                   
##       domestic eggs}      => {whole milk}       0.001016777          1 3.913649    10
## [6]  {citrus fruit,                                                                  
##       root vegetables,                                                               
##       soft cheese}        => {other vegetables} 0.001016777          1 5.168156    10
## [7]  {pip fruit,                                                                     
##       butter,                                                                        
##       hygiene articles}   => {whole milk}       0.001016777          1 3.913649    10
## [8]  {root vegetables,                                                               
##       whipped/sour cream,                                                            
##       hygiene articles}   => {whole milk}       0.001016777          1 3.913649    10
## [9]  {pip fruit,                                                                     
##       root vegetables,                                                               
##       hygiene articles}   => {whole milk}       0.001016777          1 3.913649    10
## [10] {cream cheese ,                                                                 
##       domestic eggs,                                                                 
##       sugar}              => {whole milk}       0.001118454          1 3.913649    11

All five rules have the same item listed on the right-hand side; whole milk, which is predictable given that it is the most frequently purchased item. They all have a confidence of 1 which means that a customer who buys the products listed on the right-hand side have a 100% probability of buying whole milk as well. However, all five rules have around average support and lift. The top rule states that a customer who buys rice and sugar is going to also buy milk.

When instead the data set is sorted by support (likelihood all the items will appear in the same transaction) there is one new addition to the right-hand side.

inspect(sort(rules, by = "support")[1:10])
##      lhs                        rhs                    support confidence     lift count
## [1]  {citrus fruit,                                                                     
##       tropical fruit,                                                                   
##       root vegetables,                                                                  
##       whole milk}            => {other vegetables} 0.003152008  0.8857143 4.577509    31
## [2]  {other vegetables,                                                                 
##       curd,                                                                             
##       domestic eggs}         => {whole milk}       0.002846975  0.8235294 3.223005    28
## [3]  {hamburger meat,                                                                   
##       curd}                  => {whole milk}       0.002541942  0.8064516 3.156169    25
## [4]  {herbs,                                                                            
##       rolls/buns}            => {whole milk}       0.002440264  0.8000000 3.130919    24
## [5]  {tropical fruit,                                                                   
##       herbs}                 => {whole milk}       0.002338587  0.8214286 3.214783    23
## [6]  {citrus fruit,                                                                     
##       root vegetables,                                                                  
##       other vegetables,                                                                 
##       yogurt}                => {whole milk}       0.002338587  0.8214286 3.214783    23
## [7]  {pork,                                                                             
##       other vegetables,                                                                 
##       butter}                => {whole milk}       0.002236909  0.8461538 3.311549    22
## [8]  {tropical fruit,                                                                   
##       root vegetables,                                                                  
##       yogurt,                                                                           
##       rolls/buns}            => {whole milk}       0.002236909  0.8148148 3.188899    22
## [9]  {tropical fruit,                                                                   
##       grapes,                                                                           
##       whole milk}            => {other vegetables} 0.002033554  0.8000000 4.134524    20
## [10] {root vegetables,                                                                  
##       other vegetables,                                                                 
##       yogurt,                                                                           
##       fruit/vegetable juice} => {whole milk}       0.002033554  0.8333333 3.261374    20

Other vegetables are the second most frequently purchased item so it is no surprise that it has shown up in our top 5 rules sorted by support.

To take a closer look and pinpoint rules involving the sale of whole milk the following code is run.

#What are customers likely to buy that would predict that they would buy milk?
rulesWholeMilk<-subset(rules, subset = rhs %in% "whole milk" & lift> 1.2)
inspect(sort(rulesWholeMilk, by = "confidence", decreasing = TRUE)[1:5])
##     lhs                     rhs              support confidence     lift count
## [1] {rice,                                                                    
##      sugar}              => {whole milk} 0.001220132          1 3.913649    12
## [2] {canned fish,                                                             
##      hygiene articles}   => {whole milk} 0.001118454          1 3.913649    11
## [3] {root vegetables,                                                         
##      butter,                                                                  
##      rice}               => {whole milk} 0.001016777          1 3.913649    10
## [4] {root vegetables,                                                         
##      whipped/sour cream,                                                      
##      flour}              => {whole milk} 0.001728521          1 3.913649    17
## [5] {butter,                                                                  
##      soft cheese,                                                             
##      domestic eggs}      => {whole milk} 0.001016777          1 3.913649    10
#If a customer is buying milk what other item are they then likely to purchase?
rulesWholeMilk<-subset(rules, subset = lhs %in% "whole milk" & lift> 1.2)
inspect(sort(rulesWholeMilk, by = "confidence", decreasing = TRUE)[1:5])
##     lhs                  rhs                    support confidence     lift count
## [1] {tropical fruit,                                                             
##      grapes,                                                                     
##      whole milk,                                                                 
##      yogurt}          => {other vegetables} 0.001016777  1.0000000 5.168156    10
## [2] {ham,                                                                        
##      tropical fruit,                                                             
##      pip fruit,                                                                  
##      whole milk}      => {other vegetables} 0.001118454  1.0000000 5.168156    11
## [3] {whole milk,                                                                 
##      rolls/buns,                                                                 
##      soda,                                                                       
##      newspapers}      => {other vegetables} 0.001016777  1.0000000 5.168156    10
## [4] {root vegetables,                                                            
##      whole milk,                                                                 
##      yogurt,                                                                     
##      oil}             => {other vegetables} 0.001423488  0.9333333 4.823612    14
## [5] {citrus fruit,                                                               
##      tropical fruit,                                                             
##      root vegetables,                                                            
##      whole milk,                                                                 
##      yogurt}          => {other vegetables} 0.001423488  0.9333333 4.823612    14

When whole milk is on the left-hand side we see that in the top five rules that other vegetables will be on the right-hand side. As previously stated this is unsurprising as other vegetables are a frequently purchased item. The second list (LHS) has above average lift and average to above average support which can be interpreted as the rules being relatively good.

If we were to pinpoint rules with other vegetables in it would be reasonable to expect to see other frequently purchase items in the transactions.

rulesOthrVeg<-subset(rules, subset = lhs %in% "other vegetables" & lift> 1.2)
inspect(sort(rulesOthrVeg, by = "confidence", decreasing = TRUE)[1:5])
##     lhs                     rhs              support confidence     lift count
## [1] {root vegetables,                                                         
##      other vegetables,                                                        
##      yogurt,                                                                  
##      oil}                => {whole milk} 0.001423488          1 3.913649    14
## [2] {root vegetables,                                                         
##      other vegetables,                                                        
##      butter,                                                                  
##      white bread}        => {whole milk} 0.001016777          1 3.913649    10
## [3] {pork,                                                                    
##      other vegetables,                                                        
##      butter,                                                                  
##      whipped/sour cream} => {whole milk} 0.001016777          1 3.913649    10
## [4] {other vegetables,                                                        
##      butter,                                                                  
##      whipped/sour cream,                                                      
##      domestic eggs}      => {whole milk} 0.001220132          1 3.913649    12
## [5] {pip fruit,                                                               
##      root vegetables,                                                         
##      other vegetables,                                                        
##      bottled water}      => {whole milk} 0.001118454          1 3.913649    11

Sure enough, the right-hand side is populated with whole milk. The pattern that is showing up indicates that the best evaluation of customer behavior we can make using this analysis method is that customers buying more than one frequently purchased item is likely. This of course is common sense and probably did not require a market basket analysis to discover. People are likely to buy products that are part of a weekly grocery list and are generally considered dietary staples.

If we take the third most frequently purchased item, bread and rolls, we see that it follows this pattern.

rulesRollsBums<-subset(rules, subset = lhs %in% "rolls/buns" & lift> 1.2)
inspect(sort(rulesRollsBums, by = "confidence", decreasing = TRUE)[1:5])
##     lhs                     rhs                    support confidence     lift count
## [1] {whole milk,                                                                    
##      rolls/buns,                                                                    
##      soda,                                                                          
##      newspapers}         => {other vegetables} 0.001016777  1.0000000 5.168156    10
## [2] {citrus fruit,                                                                  
##      whipped/sour cream,                                                            
##      rolls/buns,                                                                    
##      pastry}             => {whole milk}       0.001016777  1.0000000 3.913649    10
## [3] {sausage,                                                                       
##      tropical fruit,                                                                
##      root vegetables,                                                               
##      rolls/buns}         => {whole milk}       0.001016777  1.0000000 3.913649    10
## [4] {beef,                                                                          
##      tropical fruit,                                                                
##      yogurt,                                                                        
##      rolls/buns}         => {whole milk}       0.001321810  0.9285714 3.634103    13
## [5] {tropical fruit,                                                                
##      root vegetables,                                                               
##      rolls/buns,                                                                    
##      bottled water}      => {whole milk}       0.001118454  0.9166667 3.587512    11