Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.

Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

 knitr::opts_chunk$set(warning = FALSE, message = FALSE)  
#install.packages("Matrix")
# Load the package
library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(arulesViz)
library(cluster)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
transaction <- read.transactions('C://Users//Bikash_Bhowmik//Downloads//GroceryDataSet.csv', sep = ',', header = FALSE)
head(transaction)
## transactions in sparse format with
##  6 transactions (rows) and
##  169 items (columns)
str(transaction)
## Formal class 'transactions' [package "arules"] with 3 slots
##   ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
##   .. .. ..@ i       : int [1:43367] 29 88 118 132 33 157 167 166 38 91 ...
##   .. .. ..@ p       : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ...
##   .. .. ..@ Dim     : int [1:2] 169 9835
##   .. .. ..@ Dimnames:List of 2
##   .. .. .. ..$ : NULL
##   .. .. .. ..$ : NULL
##   .. .. ..@ factors : list()
##   ..@ itemInfo   :'data.frame':  169 obs. of  1 variable:
##   .. ..$ labels: chr [1:169] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "baby food" ...
##   ..@ itemsetInfo:'data.frame':  0 obs. of  0 variables

look at a summary of the data

summary(transaction)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

I can see that whole milk, other vegetables, rolls/buns, soda and yogurt are the most purchased items.

Looking at the top 25

itemFrequencyPlot(transaction, topN = 25)

itemFrequencyPlot(transaction, topN = 25, type="absolute")

I will develop association rules which are relationships between the items in the transactional data to try to identify patterns and associations amongst the grocery items using the apriori function.

I can see that the frequency above ranges from 0 to a little over 0.2. I want to capture enough items to determine associations so I will set my support at 0.01 (I experimented with 0.2, returned no rules.)

association <- apriori(transaction, parameter = list(support = 0.01, confidence = 0.5))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 98 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
rules <- head(sort(association, by = "lift"), 10)
summary(rules)
## set of 10 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3 
## 10 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       3       3       3       3       3       3 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.01007   Min.   :0.5000   Min.   :0.01729   Min.   :2.053  
##  1st Qu.:0.01103   1st Qu.:0.5315   1st Qu.:0.02021   1st Qu.:2.210  
##  Median :0.01210   Median :0.5665   Median :0.02105   Median :2.262  
##  Mean   :0.01191   Mean   :0.5539   Mean   :0.02161   Mean   :2.440  
##  3rd Qu.:0.01230   3rd Qu.:0.5802   3rd Qu.:0.02379   3rd Qu.:2.592  
##  Max.   :0.01454   Max.   :0.5862   Max.   :0.02583   Max.   :3.030  
##      count      
##  Min.   : 99.0  
##  1st Qu.:108.5  
##  Median :119.0  
##  Mean   :117.1  
##  3rd Qu.:121.0  
##  Max.   :143.0  
## 
## mining info:
##         data ntransactions support confidence
##  transaction          9835    0.01        0.5
##                                                                             call
##  apriori(data = transaction, parameter = list(support = 0.01, confidence = 0.5))
inspect(rules)
##      lhs                                  rhs                support   
## [1]  {citrus fruit, root vegetables}   => {other vegetables} 0.01037112
## [2]  {root vegetables, tropical fruit} => {other vegetables} 0.01230300
## [3]  {rolls/buns, root vegetables}     => {other vegetables} 0.01220132
## [4]  {root vegetables, yogurt}         => {other vegetables} 0.01291307
## [5]  {curd, yogurt}                    => {whole milk}       0.01006609
## [6]  {butter, other vegetables}        => {whole milk}       0.01148958
## [7]  {root vegetables, tropical fruit} => {whole milk}       0.01199797
## [8]  {root vegetables, yogurt}         => {whole milk}       0.01453991
## [9]  {domestic eggs, other vegetables} => {whole milk}       0.01230300
## [10] {whipped/sour cream, yogurt}      => {whole milk}       0.01087951
##      confidence coverage   lift     count
## [1]  0.5862069  0.01769192 3.029608 102  
## [2]  0.5845411  0.02104728 3.020999 121  
## [3]  0.5020921  0.02430097 2.594890 120  
## [4]  0.5000000  0.02582613 2.584078 127  
## [5]  0.5823529  0.01728521 2.279125  99  
## [6]  0.5736041  0.02003050 2.244885 113  
## [7]  0.5700483  0.02104728 2.230969 118  
## [8]  0.5629921  0.02582613 2.203354 143  
## [9]  0.5525114  0.02226741 2.162336 121  
## [10] 0.5245098  0.02074225 2.052747 107

The top 10 rules by lift are listed above. With a lift of 3.029, citrus fruit and root vegetables are often associated with the purchase of “other vegetables” with the highest confidence, meaning someone buying citrus fruit and root vegetables is 59% likely to also buy other vegetables. Curd and yogurt have the highest association with whole milk, however the highest frequency of purchases together is between root vegetables, yogurt and whole milk.

Use the arulesViz package to visualize the associations and their lift and support.

library(arulesViz)
plot(association, method = "graph", control = list(type = "items"))
## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE