Prompt

Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.

That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.

Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.

Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like. Due May 3 before midnight.

# Import required R libraries
#library(tidyverse)
library(arules)
library(arulesViz)

# Set seed for assignment
set.seed(200)

The arules package (https://www.rdocumentation.org/packages/arules/versions/1.7-1) “provides the infrastructure for representing, manipulating and analyzing transaction data and patterns using frequent itemsets and association rules.” The arules library contains the data structure definitions and mining algorithms - APRIORI and ECLAT.

The arulesViz library provides visualizations for the association rules.

Exploratory Data Analysis

Using the example provided at http://r-statistics.co/Association-Mining-With-R.html, I read in the provided CSV file as transactions objects per the arules package.

# http://r-statistics.co/Association-Mining-With-R.html
grocery_ds <- read.transactions("GroceryDataSet.csv", sep=",")
class(grocery_ds)

## [1] "transactions"
## attr(,"package")
## [1] "arules"

The class() function confirms the grocery_ds object is transactions from the arules package.

summary(grocery_ds)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

Summary of the grocery data set indicates 9835 rows, or individual receipts defined as transactions, with a total of 169 columns, or unique items. The most frequent items are whole milk, other vegetables, rolls/buns, soda, and yogurt. The median number of items per transaction is 3 with an average number of 4.4 items per transaction. The minimum is 1, which is expected, and the maximum items per a transaction is 32.

size(head(grocery_ds))

## [1] 4 3 1 4 4 5

LIST(head(grocery_ds, 3))

## [[1]]
## [1] "citrus fruit"        "margarine"           "ready soups"        
## [4] "semi-finished bread"
## 
## [[2]]
## [1] "coffee"         "tropical fruit" "yogurt"        
## 
## [[3]]
## [1] "whole milk"

inspect(head(grocery_ds, 3))

##     items                
## [1] {citrus fruit,       
##      margarine,          
##      ready soups,        
##      semi-finished bread}
## [2] {coffee,             
##      tropical fruit,     
##      yogurt}             
## [3] {whole milk}

The size() function confirms the item count per transaction and the LIST() function confirms the data object meets expectations, the first three transactions are displayed. The results above match the raw CSV file. The inspect() functions appears to behave the same as LIST() but perhaps a cleaner output presentation.

Support Evaluation

# calculates 'support' of the frequent items in the dataset
support_val <- 0.07
frequentItems <- eclat(grocery_ds, parameter = list(supp=support_val, maxlen=15))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.07      1     15 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 688 
## 
## create itemset ... 
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [18 item(s)] done [0.00s].
## creating sparse bit matrix ... [18 row(s), 9835 column(s)] done [0.00s].
## writing  ... [19 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

The eclat() function “finds frequent item sets with the Eclat algorithm, which carries out a depth first search on the subset lattice and determines the support of item sets by intersecting transaction lists.” (https://borgelt.net/eclat.html) The parameter supp represents \(support\) which is defined as the proportion of transactions in the dataset which contain the item. The parameter value is the minimum threshold.

inspect(frequentItems)

##      items                          support    count
## [1]  {other vegetables, whole milk} 0.07483477  736 
## [2]  {whole milk}                   0.25551601 2513 
## [3]  {other vegetables}             0.19349263 1903 
## [4]  {rolls/buns}                   0.18393493 1809 
## [5]  {yogurt}                       0.13950178 1372 
## [6]  {soda}                         0.17437722 1715 
## [7]  {root vegetables}              0.10899847 1072 
## [8]  {tropical fruit}               0.10493137 1032 
## [9]  {bottled water}                0.11052364 1087 
## [10] {sausage}                      0.09395018  924 
## [11] {shopping bags}                0.09852567  969 
## [12] {citrus fruit}                 0.08276563  814 
## [13] {pastry}                       0.08896797  875 
## [14] {pip fruit}                    0.07564820  744 
## [15] {whipped/sour cream}           0.07168277  705 
## [16] {fruit/vegetable juice}        0.07229283  711 
## [17] {newspapers}                   0.07981698  785 
## [18] {bottled beer}                 0.08052872  792 
## [19] {canned beer}                  0.07768175  764

With a support value of 0.07, the output reports the above 19 results which indicates only one pair of items appear at the given proportion of the time, whole milk and other vegetables. The result makes sense given those two items are the most frequent individually and are grocery staples.

itemFrequencyPlot(grocery_ds, topN=10, type="absolute", main="Item Frequency")

The plot above displays a count of the 10 most frequent items with whole milk and other vegetables occurring most often, matching the results of the summary() function above.

Generate Association Rules

Confidence Evaluation

# Define minimum support
supp_val <- 0.001
# Define minimum confidence (increase to get stronger rules)
conf_val <- 0.9
# Increase maxlen to get longer rules
maxlen_val <- 5
rules <- apriori(grocery_ds, parameter=list(supp=supp_val, conf=conf_val, maxlen=maxlen_val))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.9    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##       5  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.01s].
## writing ... [123 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules_conf <- sort(rules, by="confidence", decreasing=TRUE)

The apriori() function “finds association rules and frequent item sets with the Apriori algorithm, which carries out a breadth first search on the subset lattice and determines the support of item sets by subset tests.” (https://borgelt.net/apriori.html)

inspect(head(rules_conf, 10))

##      lhs                     rhs                    support confidence    coverage     lift count
## [1]  {rice,                                                                                      
##       sugar}              => {whole milk}       0.001220132          1 0.001220132 3.913649    12
## [2]  {canned fish,                                                                               
##       hygiene articles}   => {whole milk}       0.001118454          1 0.001118454 3.913649    11
## [3]  {butter,                                                                                    
##       rice,                                                                                      
##       root vegetables}    => {whole milk}       0.001016777          1 0.001016777 3.913649    10
## [4]  {flour,                                                                                     
##       root vegetables,                                                                           
##       whipped/sour cream} => {whole milk}       0.001728521          1 0.001728521 3.913649    17
## [5]  {butter,                                                                                    
##       domestic eggs,                                                                             
##       soft cheese}        => {whole milk}       0.001016777          1 0.001016777 3.913649    10
## [6]  {citrus fruit,                                                                              
##       root vegetables,                                                                           
##       soft cheese}        => {other vegetables} 0.001016777          1 0.001016777 5.168156    10
## [7]  {butter,                                                                                    
##       hygiene articles,                                                                          
##       pip fruit}          => {whole milk}       0.001016777          1 0.001016777 3.913649    10
## [8]  {hygiene articles,                                                                          
##       root vegetables,                                                                           
##       whipped/sour cream} => {whole milk}       0.001016777          1 0.001016777 3.913649    10
## [9]  {hygiene articles,                                                                          
##       pip fruit,                                                                                 
##       root vegetables}    => {whole milk}       0.001016777          1 0.001016777 3.913649    10
## [10] {cream cheese,                                                                              
##       domestic eggs,                                                                             
##       sugar}              => {whole milk}       0.001118454          1 0.001118454 3.913649    11

With a support value of 0.001 and confidence value of 0.9, the above output shows 10 association rules with a resulting confidence value of 1. The confidence value of 1 indicates the item on the right hand side always occurs when the item or items on the left hand side occur. Not surprisingly, of the 10 rules displayed above, 9 rules indicate whole milk on the right hand side with other vegetables as the remaining value on the right hand side. Given the higher frequency of those two items, these results are expected.

Lift Evaluation

rules_lift <- sort(rules, by="lift", decreasing=TRUE)
inspect(head(rules_lift, 10))

##      lhs                         rhs                    support confidence    coverage      lift count
## [1]  {liquor,                                                                                         
##       red/blush wine}         => {bottled beer}     0.001931876  0.9047619 0.002135231 11.235269    19
## [2]  {citrus fruit,                                                                                   
##       fruit/vegetable juice,                                                                          
##       other vegetables,                                                                               
##       soda}                   => {root vegetables}  0.001016777  0.9090909 0.001118454  8.340400    10
## [3]  {butter,                                                                                         
##       cream cheese,                                                                                   
##       root vegetables}        => {yogurt}           0.001016777  0.9090909 0.001118454  6.516698    10
## [4]  {butter,                                                                                         
##       sliced cheese,                                                                                  
##       tropical fruit,                                                                                 
##       whole milk}             => {yogurt}           0.001016777  0.9090909 0.001118454  6.516698    10
## [5]  {cream cheese,                                                                                   
##       curd,                                                                                           
##       other vegetables,                                                                               
##       whipped/sour cream}     => {yogurt}           0.001016777  0.9090909 0.001118454  6.516698    10
## [6]  {butter,                                                                                         
##       other vegetables,                                                                               
##       tropical fruit,                                                                                 
##       white bread}            => {yogurt}           0.001016777  0.9090909 0.001118454  6.516698    10
## [7]  {citrus fruit,                                                                                   
##       root vegetables,                                                                                
##       soft cheese}            => {other vegetables} 0.001016777  1.0000000 0.001016777  5.168156    10
## [8]  {brown bread,                                                                                    
##       pip fruit,                                                                                      
##       whipped/sour cream}     => {other vegetables} 0.001118454  1.0000000 0.001118454  5.168156    11
## [9]  {grapes,                                                                                         
##       tropical fruit,                                                                                 
##       whole milk,                                                                                     
##       yogurt}                 => {other vegetables} 0.001016777  1.0000000 0.001016777  5.168156    10
## [10] {ham,                                                                                            
##       pip fruit,                                                                                      
##       tropical fruit,                                                                                 
##       yogurt}                 => {other vegetables} 0.001016777  1.0000000 0.001016777  5.168156    10

The same values of support and confidence, and thus from the same resulting rules, the rules are sorted by lift value in order to find the 10 rules with the highest lift. The lift value indicates “the deviation of the support of the whole rule from the support expected under independence given the supports of the LHS and the RHS.” (https://cran.r-project.org/web/packages/arules/vignettes/arules.pdf) The higher the lift value, the stronger the association between the LHS and the RHS.

Note: I decided on confidence value of 0.9 so I could receive some high lift rules that didn’t indicate whole milk and other vegetables on the right hand side. The top lift result (11.235269) given the parameter values is an LHS of liquor and red/blush wine and an RHS of bottled beer. That association makes sense for someone buying alcohol together. The rules with an RHS of yogurt shows different LHS items including butter, cream cheese, curd, and whipped/sour cream which seems valid given those items are typically in the same refrigerated section of a grocery store. Overall, none of the above rules sticks out like the infamous diapers to beer association.

# Could I find the diapers => beer rule
supp_baby_val <- 0.001
conf_baby_val <- 0.05
rules_baby <- apriori(grocery_ds, 
                 parameter=list(supp=supp_baby_val, conf=conf_baby_val),
                 appearance = list (default="rhs", lhs="baby food"),
                 control = list (verbose=F))

rules_baby_conf <- sort (rules_baby, by="confidence", decreasing=TRUE)
inspect(head(rules_baby_conf))

##     lhs    rhs                support   confidence coverage lift count
## [1] {}  => {whole milk}       0.2555160 0.2555160  1        1    2513 
## [2] {}  => {other vegetables} 0.1934926 0.1934926  1        1    1903 
## [3] {}  => {rolls/buns}       0.1839349 0.1839349  1        1    1809 
## [4] {}  => {soda}             0.1743772 0.1743772  1        1    1715 
## [5] {}  => {yogurt}           0.1395018 0.1395018  1        1    1372 
## [6] {}  => {bottled water}    0.1105236 0.1105236  1        1    1087

Nope, couldn’t find the baby product to beer rule. Apparently, “baby food” as the LHS didn’t produce any meaningful association, nor did “baby cosmetics” as the LHS.

Visualizations

I used the plotting functions from the library arulesViz to help understand the association rules through visualizations.

# https://cran.r-project.org/web/packages/arulesViz/vignettes/arulesViz.pdf
options(digits = 2)
plot(rules)

The above scatterplot indicates the relationship the values of Support and Confidence for the 123 association rules generated by with a support value of 0 and confidence value of 0.9.

plot(rules, measure=c("support", "lift"), shading="confidence")

The above scatterplot indicates the relationship the values of Support and Lift for the 123 association rules generated by with a support value of 0 and confidence value of 0.9.

plot(rules, method="two-key plot")

The above two-key scatterplot indicates the relationship the values of Support and Confidence for the 123 association rules generated by with a support value of 0 and confidence value of 0.9 in which order identifies the number of items in the rule.

plot(rules, method="grouped", control = list(k = 10))

The above grouped matrix-based visualization uses a balloon plot to show the LHS values as columns and the RHS items as rows. The color of the balloon shows the aggregated interest measure and the size of the balloon show the aggregated support.

subrules2 <- head(rules, n=10, by="lift")
plot(subrules2, method="graph")

The above graph-based visualization shows the items and rules as vertices and connections with directed edges. The plot helps identify which rules share items.

Conclusion

Overall, the market basket analysis proved straightforward with the use of the arules package. In order to tease out some “interesting” associations, then more modifications of the support and confidence levels would be required to find associations with few occurrences but with high confidence values. Too bad this dataset didn’t have the diapers to beer connection.

DATA 624 Assignment 10

CUNY Fall 2021

Philip Tanofsky

05 December 2021