Objective:

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’. That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.

Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Libraries Required

#install.packages("arules")
library("arules")
#Load and read data
Data <- read.transactions("GroceryDataSet.csv")
## Warning in asMethod(object): removing duplicated items in transactions
Data
## transactions in sparse format with
##  9835 transactions (rows) and
##  8219 items (columns)

There are 9835 transactions (rows) and 8219 items (columns)

summary(Data)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  8219 columns (items) and a density of 0.0004422899 
## 
## most frequent items:
## vegetables,whole            whole         tropical            other 
##              940              717              482              460 
##           citrus          (Other) 
##              453            32700 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1380 2733 1774 1257  910  601  415  293  166   95   75   44   39   19   11    9 
##   17   18   19   20   21   23 
##    2    3    3    3    1    2 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.635   5.000  23.000 
## 
## includes extended item information - examples:
##                                labels
## 1    (appetizer),,,,,,,,,,,,,,,,,,,,,
## 2  (appetizer),,,,,,,,,,,,,,,,,,,,,,,
## 3 (appetizer),,,,,,,,,,,,,,,,,,,,,,,,

We can see from the summary above that (VEGETABLES) is the most frequent item with 940, followed by (WHOLE) with 717. Let’s see a visual using the item frequency plot.

itemFrequencyPlot(Data, topN=20, type="relative")

The frequency plot above gives a better visual by creating an item Frequency Bar Plot to view the distribution of objects based on itemMatrix.

Basket Rules

Train the apriori to extract rules by defining the minimum support and confidence value.

min_suport <- 6 * 7/ nrow(Data)
min_suport
## [1] 0.004270463
# Training Apriori on the grocery dataset

rules <- apriori(Data, parameter = list(supp = 0.004, conf = 0.3, maxlen=3))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5   0.004      1
##  maxlen target  ext
##       3  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 39 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[8219 item(s), 9835 transaction(s)] done [0.02s].
## sorting and recoding items ... [130 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3
## Warning in apriori(Data, parameter = list(supp = 0.004, conf = 0.3, maxlen
## = 3)): Mining stopped (maxlen reached). Only patterns up to a length of 3
## returned!
##  done [0.00s].
## writing ... [35 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Inspect top 10 rules by lift. Lift indicates the significance of the rule.

summary(rules)
## set of 35 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3 
## 28  7 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0     2.0     2.0     2.2     2.0     3.0 
## 
## summary of quality measures:
##     support           confidence        coverage             lift        
##  Min.   :0.004270   Min.   :0.3000   Min.   :0.004270   Min.   :  3.139  
##  1st Qu.:0.005084   1st Qu.:0.4126   1st Qu.:0.005948   1st Qu.:  7.115  
##  Median :0.007321   Median :0.6842   Median :0.014743   Median : 12.866  
##  Mean   :0.011193   Mean   :0.7004   Mean   :0.019609   Mean   : 26.182  
##  3rd Qu.:0.014082   3rd Qu.:1.0000   3rd Qu.:0.035587   3rd Qu.: 26.726  
##  Max.   :0.037417   Max.   :1.0000   Max.   :0.046772   Max.   :196.700  
##      count      
##  Min.   : 42.0  
##  1st Qu.: 50.0  
##  Median : 72.0  
##  Mean   :110.1  
##  3rd Qu.:138.5  
##  Max.   :368.0  
## 
## mining info:
##  data ntransactions support confidence
##  Data          9835   0.004        0.3
##                                                                          call
##  apriori(data = Data, parameter = list(supp = 0.004, conf = 0.3, maxlen = 3))
inspect(sort(rules, by = 'lift')[1:10])
##      lhs                                      
## [1]  {bags,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,}  =>
## [2]  {shopping}                             =>
## [3]  {water,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} =>
## [4]  {juice,long}                           =>
## [5]  {juice,long}                           =>
## [6]  {bakery}                               =>
## [7]  {life}                                 =>
## [8]  {bakery, juice,long}                   =>
## [9]  {juice,long, life}                     =>
## [10] {bakery, vegetables,whole}             =>
##      rhs                                   support     confidence coverage   
## [1]  {shopping}                            0.004880529 0.96       0.005083884
## [2]  {bags,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} 0.004880529 1.00       0.004880529
## [3]  {bottled}                             0.006812405 1.00       0.006812405
## [4]  {bakery}                              0.004270463 1.00       0.004270463
## [5]  {life}                                0.004270463 1.00       0.004270463
## [6]  {life}                                0.037417387 1.00       0.037417387
## [7]  {bakery}                              0.037417387 1.00       0.037417387
## [8]  {life}                                0.004270463 1.00       0.004270463
## [9]  {bakery}                              0.004270463 1.00       0.004270463
## [10] {life}                                0.006609049 1.00       0.006609049
##      lift      count
## [1]  196.70000  48  
## [2]  196.70000  48  
## [3]   28.26149  67  
## [4]   26.72554  42  
## [5]   26.72554  42  
## [6]   26.72554 368  
## [7]   26.72554 368  
## [8]   26.72554  42  
## [9]   26.72554  42  
## [10]  26.72554  65
# top 10 by confidence
inspect(head(rules, n = 10, by = "confidence"))
##      lhs                                      
## [1]  {shopping}                             =>
## [2]  {water,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} =>
## [3]  {juice,long}                           =>
## [4]  {juice,long}                           =>
## [5]  {cream,cream}                          =>
## [6]  {milk,cream}                           =>
## [7]  {,frozen}                              =>
## [8]  {cheese,cream}                         =>
## [9]  {bakery}                               =>
## [10] {life}                                 =>
##      rhs                                   support     confidence coverage   
## [1]  {bags,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} 0.004880529 1          0.004880529
## [2]  {bottled}                             0.006812405 1          0.006812405
## [3]  {bakery}                              0.004270463 1          0.004270463
## [4]  {life}                                0.004270463 1          0.004270463
## [5]  {cheese}                              0.004473818 1          0.004473818
## [6]  {cheese}                              0.006914082 1          0.006914082
## [7]  {cheese}                              0.004677173 1          0.004677173
## [8]  {cheese}                              0.005287239 1          0.005287239
## [9]  {life}                                0.037417387 1          0.037417387
## [10] {bakery}                              0.037417387 1          0.037417387
##      lift      count
## [1]  196.70000  48  
## [2]   28.26149  67  
## [3]   26.72554  42  
## [4]   26.72554  42  
## [5]   25.21795  44  
## [6]   25.21795  68  
## [7]   25.21795  46  
## [8]   25.21795  52  
## [9]   26.72554 368  
## [10]  26.72554 368

Cluster analysis

Finally we run a simple cluster analysis, focusing on the items in the transaction data that have greater than 4% support. We follow the logic and code example given in the help file for the dissimilarity function in the arules library.

# follow logic of "dissimilarity" help file from "arules" package
# cluster analysis on items with support > 5%
Data1 <- Data[ , itemFrequency(Data) > 0.04]
d_jaccard <- dissimilarity(Data1, which = "items")
# plot dendrogram
plot(hclust(d_jaccard, method = "ward.D2"), 
     main = "Dendrogram for items", sub = "", xlab = "")