Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.

Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like. Due May 3 before midnight.

Loading Libraries and Data

library(arules)
library(arulesViz)
Grocery <- read.transactions("GroceryDataSet.csv", sep = ",")

Data summary and inspection

Grocery
## transactions in sparse format with
##  9835 transactions (rows) and
##  169 items (columns)
summary(Grocery)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics
inspect(Grocery[1:6])
##     items                     
## [1] {citrus fruit,            
##      margarine,               
##      ready soups,             
##      semi-finished bread}     
## [2] {coffee,                  
##      tropical fruit,          
##      yogurt}                  
## [3] {whole milk}              
## [4] {cream cheese,            
##      meat spreads,            
##      pip fruit,               
##      yogurt}                  
## [5] {condensed milk,          
##      long life bakery product,
##      other vegetables,        
##      whole milk}              
## [6] {abrasive cleaner,        
##      butter,                  
##      rice,                    
##      whole milk,              
##      yogurt}

Item Frequency and Plot

itemFrequency(Grocery[,1:10])
## abrasive cleaner artif. sweetener   baby cosmetics        baby food 
##     0.0035587189     0.0032536858     0.0006100661     0.0001016777 
##             bags    baking powder bathroom cleaner             beef 
##     0.0004067107     0.0176919166     0.0027452974     0.0524656838 
##          berries        beverages 
##     0.0332486019     0.0260294865
itemFrequencyPlot(Grocery, topN=10)

Generating Association Rules

Selecting to even lower the minimum support and confidence than the default of 0.1 and 0.8 respectively to generate more than a handful of possible rules.

basket.model <- apriori(Grocery, parameter = list(support=0.007, confidence=0.25, minlen=2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5   0.007      2
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 68 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [104 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [363 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(basket.model)
## set of 363 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4 
## 137 214  12 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   2.656   3.000   4.000 
## 
## summary of quality measures:
##     support           confidence          lift            count      
##  Min.   :0.007016   Min.   :0.2500   Min.   :0.9932   Min.   : 69.0  
##  1st Qu.:0.008134   1st Qu.:0.2962   1st Qu.:1.6060   1st Qu.: 80.0  
##  Median :0.009659   Median :0.3551   Median :1.9086   Median : 95.0  
##  Mean   :0.012945   Mean   :0.3743   Mean   :2.0072   Mean   :127.3  
##  3rd Qu.:0.013777   3rd Qu.:0.4420   3rd Qu.:2.3289   3rd Qu.:135.5  
##  Max.   :0.074835   Max.   :0.6389   Max.   :3.9565   Max.   :736.0  
## 
## mining info:
##     data ntransactions support confidence
##  Grocery          9835   0.007       0.25
inspect(basket.model[1:6])
##     lhs                      rhs                support     confidence lift    
## [1] {herbs}               => {root vegetables}  0.007015760 0.4312500  3.956477
## [2] {herbs}               => {other vegetables} 0.007727504 0.4750000  2.454874
## [3] {herbs}               => {whole milk}       0.007727504 0.4750000  1.858983
## [4] {processed cheese}    => {whole milk}       0.007015760 0.4233129  1.656698
## [5] {semi-finished bread} => {whole milk}       0.007117438 0.4022989  1.574457
## [6] {detergent}           => {whole milk}       0.008947636 0.4656085  1.822228
##     count
## [1] 69   
## [2] 76   
## [3] 76   
## [4] 69   
## [5] 70   
## [6] 88

Selecting to based the Top 10 rules on Lift. The higher the Lift, the higher the chances of LHS and RHS items occuring together.

Top 10 Association RUles by Lift

inspect(sort(basket.model, by="lift")[1:10])
##      lhs                   rhs                      support confidence     lift count
## [1]  {herbs}            => {root vegetables}    0.007015760  0.4312500 3.956477    69
## [2]  {berries}          => {whipped/sour cream} 0.009049314  0.2721713 3.796886    89
## [3]  {other vegetables,                                                              
##       tropical fruit,                                                                
##       whole milk}       => {root vegetables}    0.007015760  0.4107143 3.768074    69
## [4]  {beef,                                                                          
##       other vegetables} => {root vegetables}    0.007930859  0.4020619 3.688692    78
## [5]  {other vegetables,                                                              
##       tropical fruit}   => {pip fruit}          0.009456024  0.2634561 3.482649    93
## [6]  {beef,                                                                          
##       whole milk}       => {root vegetables}    0.008032537  0.3779904 3.467851    79
## [7]  {other vegetables,                                                              
##       pip fruit}        => {tropical fruit}     0.009456024  0.3618677 3.448613    93
## [8]  {citrus fruit,                                                                  
##       other vegetables} => {root vegetables}    0.010371124  0.3591549 3.295045   102
## [9]  {other vegetables,                                                              
##       whole milk,                                                                    
##       yogurt}           => {tropical fruit}     0.007625826  0.3424658 3.263712    75
## [10] {other vegetables,                                                              
##       whole milk,                                                                    
##       yogurt}           => {root vegetables}    0.007829181  0.3515982 3.225716    77

Graph of Top 10 Rules by Lift

top10Rules <- head(basket.model, n = 10, by = "lift")
plot(top10Rules, method = "graph")

CONCLUSION

The majority of items are fruits and vegetables combinations. It is not surprising as most of them are convenietly located in the same section grocery stores. It is also not surprising that milk comes up often in the list, as singularly, it is most bought items in the pool of transactions. What maybe a surprise is the rule about berries and whipped cream, they are not physically near each other in the grocery section, but it makes sense to market them together (e.g. when one is on sale, the other maybe on sale, too).