Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.

Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Do a simple cluster analysis on the data as well. Use whichever packages you like.

library(arules)
library(RColorBrewer)
library(kableExtra)

Load Data

Let’s load the data and examine it

data <- read.csv("https://raw.githubusercontent.com/monuchacko/cuny_msds/master/data_624/data/GroceryDataSet.csv", header=FALSE)

# View Summary
summary(data) %>%
  kable() %>%
    kable_styling()
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32
sausage : 825 :2159 :3802 :5101 :6106 :6961 :7606 :8151 :8589 :8939 :9185 :9367 :9484 :9562 :9639 :9694 :9740 :9769 :9783 :9797 :9806 :9817 :9821 :9827 :9828 :9828 :9829 :9830 :9831 :9834 :9834 :9834
whole milk : 717 whole milk : 654 whole milk : 506 whole milk : 315 rolls/buns : 176 soda : 150 soda : 120 shopping bags: 76 soda : 61 shopping bags : 49 shopping bags: 40 soda : 30 soda : 24 shopping bags : 18 shopping bags : 16 shopping bags : 11 napkins : 8 candy : 5 detergent : 4 bottled beer : 3 napkins : 4 napkins : 2 waffles : 2 bottled beer : 2 chocolate : 2 chocolate : 1 abrasive cleaner : 1 chocolate : 1 cooking chocolate : 1 skin care: 1 hygiene articles: 1 candles: 1
frankfurter : 580 other vegetables: 550 other vegetables: 415 other vegetables: 254 soda : 168 rolls/buns : 146 shopping bags: 107 bottled water: 68 shopping bags : 56 soda : 39 newspapers : 36 shopping bags : 19 shopping bags : 18 fruit/vegetable juice: 17 napkins : 13 napkins : 9 chocolate : 5 chocolate : 5 fruit/vegetable juice: 4 napkins : 3 fruit/vegetable juice : 2 baking powder : 1 chocolate marshmallow: 1 bottled water: 1 fruit/vegetable juice : 1 female sanitary products: 1 chocolate : 1 hygiene articles: 1 house keeping products: 2 NA NA NA
tropical fruit : 482 root vegetables : 383 rolls/buns : 293 rolls/buns : 238 yogurt : 160 shopping bags: 107 rolls/buns : 92 newspapers : 66 fruit/vegetable juice: 55 fruit/vegetable juice: 34 pastry : 27 chocolate : 17 fruit/vegetable juice: 16 newspapers : 14 fruit/vegetable juice: 11 chocolate : 8 newspapers : 5 napkins : 5 shopping bags : 4 pot plants : 3 house keeping products: 2 bottled beer : 1 cling film/bags : 1 cake bar : 1 liquor (appetizer) : 1 long life bakery product: 1 hygiene articles : 2 napkins : 2 soups : 1 NA NA NA
other vegetables: 460 rolls/buns : 378 yogurt : 289 soda : 211 whole milk : 149 bottled water: 95 newspapers : 68 rolls/buns : 59 bottled water : 54 newspapers : 33 bottled water: 25 fruit/vegetable juice: 17 napkins : 14 soda : 14 hygiene articles : 11 hygiene articles : 6 candy : 4 newspapers : 3 chocolate : 3 candy : 2 hygiene articles : 2 cleaner : 1 dental care : 1 coffee : 1 long life bakery product: 1 margarine : 1 long life bakery product: 1 sugar : 1 NA NA NA NA
citrus fruit : 453 tropical fruit : 355 soda : 229 yogurt : 202 shopping bags: 145 yogurt : 93 domestic eggs: 57 soda : 59 newspapers : 51 bottled water : 26 napkins : 23 napkins : 17 newspapers : 14 napkins : 11 candy : 9 long life bakery product: 6 fruit/vegetable juice: 4 bottled water: 2 bottled water : 2 hygiene articles: 2 candles : 1 cling film/bags: 1 dog food : 1 flour : 1 pasta : 1 rum : 1 specialty fat : 1 NA NA NA NA NA
(Other) :6318 (Other) :5356 (Other) :4301 (Other) :3514 (Other) :2931 (Other) :2283 (Other) :1785 (Other) :1356 (Other) : 969 (Other) : 715 (Other) : 499 (Other) : 368 (Other) : 265 (Other) : 199 (Other) : 136 (Other) : 101 (Other) : 69 (Other) : 46 (Other) : 35 (Other) : 25 (Other) : 18 (Other) : 12 (Other) : 8 (Other) : 2 white wine : 1 (Other) : 2 NA NA NA NA NA NA
groceryDataset = read.transactions("https://raw.githubusercontent.com/monuchacko/cuny_msds/master/data_624/data/GroceryDataSet.csv", sep = ',', rm.duplicates = TRUE)

Analyse data

# View summary
summary(groceryDataset)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

As we see from the summary, ‘whole milk’ is the most frequent item with 2513 and then followed by ‘other vegetables’ with 1903. Lets see a visual using the item frequency plot.

itemFrequencyPlot(groceryDataset,topN=10, type="absolute", col=brewer.pal(8,'Pastel2'), main="Absolute Item Frequency Plot")

We see a better visual using the frequency plot above. ItemFrequencyPlot was used to create an item Frequency Bar Plot to view the distribution of objects based on itemMatrix.

Train/ Extract

Lets train the apriori to extract rules by defining the minimim support and confidence value. This is basically the likelihood of the product to be purchased.

min_suport <- 6 * 7/ nrow(groceryDataset)
min_suport
## [1] 0.004270463

We find the confidence which is the likelihood of a product being purchased given another product is purchased.

Confidence(p1 -> p2) = # of observation where p1 and p2 purchased/ # of observation where p1 purchased

# Training Apriori on the grocery dataset
rules = apriori(data = groceryDataset, parameter = list(support = 0.004, confidence = 0.6))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5   0.004      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 39 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [126 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.01s].
## writing ... [40 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Inspect top 10 rules by lift. Lift indicates the significance of the rule.

inspect(sort(rules, by = 'lift')[1:10])
##      lhs                     rhs                    support confidence    coverage     lift count
## [1]  {citrus fruit,                                                                              
##       root vegetables,                                                                           
##       tropical fruit}     => {other vegetables} 0.004473818  0.7857143 0.005693950 4.060694    44
## [2]  {citrus fruit,                                                                              
##       root vegetables,                                                                           
##       whole milk}         => {other vegetables} 0.005795628  0.6333333 0.009150991 3.273165    57
## [3]  {pip fruit,                                                                                 
##       root vegetables,                                                                           
##       whole milk}         => {other vegetables} 0.005490595  0.6136364 0.008947636 3.171368    54
## [4]  {root vegetables,                                                                           
##       tropical fruit,                                                                            
##       yogurt}             => {other vegetables} 0.004982206  0.6125000 0.008134215 3.165495    49
## [5]  {pip fruit,                                                                                 
##       whipped/sour cream} => {other vegetables} 0.005592272  0.6043956 0.009252669 3.123610    55
## [6]  {onions,                                                                                    
##       root vegetables}    => {other vegetables} 0.005693950  0.6021505 0.009456024 3.112008    56
## [7]  {curd,                                                                                      
##       domestic eggs}      => {whole milk}       0.004778851  0.7343750 0.006507372 2.874086    47
## [8]  {butter,                                                                                    
##       curd}               => {whole milk}       0.004880529  0.7164179 0.006812405 2.803808    48
## [9]  {tropical fruit,                                                                            
##       whipped/sour cream,                                                                        
##       yogurt}             => {whole milk}       0.004372140  0.7049180 0.006202339 2.758802    43
## [10] {root vegetables,                                                                           
##       tropical fruit,                                                                            
##       yogurt}             => {whole milk}       0.005693950  0.7000000 0.008134215 2.739554    56

We started with 0.6 confidence. Lets reduce the confidence to 0.4 and see if it is better.

# Training Apriori on the dataset
rules = apriori(data = groceryDataset, parameter = list(support = 0.004, confidence = 0.4))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5   0.004      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 39 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [126 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.01s].
## writing ... [432 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Let see the performance after we changed the confidence to 0.4

# Visualising the results
inspect(sort(rules, by = 'lift')[1:10])
##      lhs                     rhs                    support confidence    coverage     lift count
## [1]  {liquor}             => {bottled beer}     0.004677173  0.4220183 0.011082867 5.240594    46
## [2]  {herbs,                                                                                     
##       whole milk}         => {root vegetables}  0.004168785  0.5394737 0.007727504 4.949369    41
## [3]  {citrus fruit,                                                                              
##       other vegetables,                                                                          
##       tropical fruit}     => {root vegetables}  0.004473818  0.4943820 0.009049314 4.535678    44
## [4]  {citrus fruit,                                                                              
##       other vegetables,                                                                          
##       root vegetables}    => {tropical fruit}   0.004473818  0.4313725 0.010371124 4.110997    44
## [5]  {citrus fruit,                                                                              
##       other vegetables,                                                                          
##       whole milk}         => {root vegetables}  0.005795628  0.4453125 0.013014743 4.085493    57
## [6]  {citrus fruit,                                                                              
##       root vegetables,                                                                           
##       tropical fruit}     => {other vegetables} 0.004473818  0.7857143 0.005693950 4.060694    44
## [7]  {herbs}              => {root vegetables}  0.007015760  0.4312500 0.016268429 3.956477    69
## [8]  {tropical fruit,                                                                            
##       whipped/sour cream,                                                                        
##       whole milk}         => {yogurt}           0.004372140  0.5512821 0.007930859 3.951792    43
## [9]  {citrus fruit,                                                                              
##       pip fruit}          => {tropical fruit}   0.005592272  0.4044118 0.013828165 3.854060    55
## [10] {whipped/sour cream,                                                                        
##       whole milk,                                                                                
##       yogurt}             => {tropical fruit}   0.004372140  0.4018692 0.010879512 3.829829    43

Although it looks better, there is more room for improvement. For example we see ‘citrus fruit’ in multiple rules. We can further change the minimum value to 0.2 and evaluate its performance.

# Training Apriori on the dataset
rules = apriori(data = groceryDataset, parameter = list(support = 0.004, confidence = 0.2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5   0.004      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 39 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [126 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [1268 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Visualising the results
inspect(sort(rules, by = 'lift')[1:10])
##      lhs                     rhs                      support confidence    coverage     lift count
## [1]  {flour}              => {sugar}              0.004982206  0.2865497 0.017386884 8.463112    49
## [2]  {processed cheese}   => {white bread}        0.004168785  0.2515337 0.016573462 5.975445    41
## [3]  {liquor}             => {bottled beer}       0.004677173  0.4220183 0.011082867 5.240594    46
## [4]  {berries,                                                                                     
##       whole milk}         => {whipped/sour cream} 0.004270463  0.3620690 0.011794611 5.050990    42
## [5]  {herbs,                                                                                       
##       whole milk}         => {root vegetables}    0.004168785  0.5394737 0.007727504 4.949369    41
## [6]  {citrus fruit,                                                                                
##       other vegetables,                                                                            
##       tropical fruit}     => {root vegetables}    0.004473818  0.4943820 0.009049314 4.535678    44
## [7]  {other vegetables,                                                                            
##       root vegetables,                                                                             
##       tropical fruit}     => {citrus fruit}       0.004473818  0.3636364 0.012302999 4.393567    44
## [8]  {whipped/sour cream,                                                                          
##       yogurt}             => {curd}               0.004575496  0.2205882 0.020742247 4.140239    45
## [9]  {citrus fruit,                                                                                
##       other vegetables,                                                                            
##       root vegetables}    => {tropical fruit}     0.004473818  0.4313725 0.010371124 4.110997    44
## [10] {citrus fruit,                                                                                
##       other vegetables,                                                                            
##       whole milk}         => {root vegetables}    0.005795628  0.4453125 0.013014743 4.085493    57

As we see we have better results and the association rules looks better.