Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.

Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Import Libraries and dataset

library(arules)
## Warning: package 'arules' was built under R version 4.2.3
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.2.3
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(readxl)
## Warning: package 'readxl' was built under R version 4.2.3
grocery <- read.transactions("H:/Brian/CUNY/Spring 2024/Data 624/HW10/GroceryDataSet.csv",format = "basket", sep = ",")

#look at a summary of the data

summary(grocery)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

We can see that whole milk, other vegetables, rolls/buns, soda and yogurt are the most purchased items.

Look at distribution of items listed. Look at the top 25

itemFrequencyPlot(grocery, topN = 25)

itemFrequencyPlot(grocery, topN = 25, type="absolute")

We will develop association rules which are relationships between the items in the transactional data to try to identify patterns and associations amongst the grocery items using the apriori function.

We can see that the frequency above ranges from 0 to a little over 0.2. We want to capture enough items to determine associations so I will set my support at 0.01 (I experimented with 0.2, returned no rules.)

association <- apriori(grocery, parameter = list(support = 0.01, confidence = 0.5))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 98 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
rules <- head(sort(association, by = "lift"), 10)
summary(rules)
## set of 10 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3 
## 10 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       3       3       3       3       3       3 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.01007   Min.   :0.5000   Min.   :0.01729   Min.   :2.053  
##  1st Qu.:0.01103   1st Qu.:0.5315   1st Qu.:0.02021   1st Qu.:2.210  
##  Median :0.01210   Median :0.5665   Median :0.02105   Median :2.262  
##  Mean   :0.01191   Mean   :0.5539   Mean   :0.02161   Mean   :2.440  
##  3rd Qu.:0.01230   3rd Qu.:0.5802   3rd Qu.:0.02379   3rd Qu.:2.592  
##  Max.   :0.01454   Max.   :0.5862   Max.   :0.02583   Max.   :3.030  
##      count      
##  Min.   : 99.0  
##  1st Qu.:108.5  
##  Median :119.0  
##  Mean   :117.1  
##  3rd Qu.:121.0  
##  Max.   :143.0  
## 
## mining info:
##     data ntransactions support confidence
##  grocery          9835    0.01        0.5
##                                                                         call
##  apriori(data = grocery, parameter = list(support = 0.01, confidence = 0.5))
inspect(rules)
##      lhs                                  rhs                support   
## [1]  {citrus fruit, root vegetables}   => {other vegetables} 0.01037112
## [2]  {root vegetables, tropical fruit} => {other vegetables} 0.01230300
## [3]  {rolls/buns, root vegetables}     => {other vegetables} 0.01220132
## [4]  {root vegetables, yogurt}         => {other vegetables} 0.01291307
## [5]  {curd, yogurt}                    => {whole milk}       0.01006609
## [6]  {butter, other vegetables}        => {whole milk}       0.01148958
## [7]  {root vegetables, tropical fruit} => {whole milk}       0.01199797
## [8]  {root vegetables, yogurt}         => {whole milk}       0.01453991
## [9]  {domestic eggs, other vegetables} => {whole milk}       0.01230300
## [10] {whipped/sour cream, yogurt}      => {whole milk}       0.01087951
##      confidence coverage   lift     count
## [1]  0.5862069  0.01769192 3.029608 102  
## [2]  0.5845411  0.02104728 3.020999 121  
## [3]  0.5020921  0.02430097 2.594890 120  
## [4]  0.5000000  0.02582613 2.584078 127  
## [5]  0.5823529  0.01728521 2.279125  99  
## [6]  0.5736041  0.02003050 2.244885 113  
## [7]  0.5700483  0.02104728 2.230969 118  
## [8]  0.5629921  0.02582613 2.203354 143  
## [9]  0.5525114  0.02226741 2.162336 121  
## [10] 0.5245098  0.02074225 2.052747 107

15 rules were generated. The top 10 rules by lift are listed above. With a lift of 3.029, citrus fruit and root vegetables are often associated with the purchase of “other vegetables” with the highest confidence, meaning someone buying citrus fruit and root vegetables is 59% likely to also buy other vegetables. Curd and yogurt have the highest association with whole milk, however the highest frequency of purchases together is between root vegetables, yogurt and whole milk.

Use the arulesViz package to visualize the associations and their lift and support.

library(arulesViz)
plot(association, method = "graph", control = list(type = "items"))
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE