Market Basket

I’ll follow the guidelines of this page:

https://www.kirenz.com/post/2020-05-14-r-association-rule-mining/#association-rules

Though rather than convert the CSV into a list of character vectors as mentioned in the article, I will use read.transactions() from arules.

library(arules)

groceries <- read.transactions("/Users/harris/ds_masters/DATA624/market_basket/GroceryDataSet.csv", sep = ",")

dim(groceries)
## [1] 9835  169
itemLabels(groceries)
##   [1] "abrasive cleaner"          "artif. sweetener"         
##   [3] "baby cosmetics"            "baby food"                
##   [5] "bags"                      "baking powder"            
##   [7] "bathroom cleaner"          "beef"                     
##   [9] "berries"                   "beverages"                
##  [11] "bottled beer"              "bottled water"            
##  [13] "brandy"                    "brown bread"              
##  [15] "butter"                    "butter milk"              
##  [17] "cake bar"                  "candles"                  
##  [19] "candy"                     "canned beer"              
##  [21] "canned fish"               "canned fruit"             
##  [23] "canned vegetables"         "cat food"                 
##  [25] "cereals"                   "chewing gum"              
##  [27] "chicken"                   "chocolate"                
##  [29] "chocolate marshmallow"     "citrus fruit"             
##  [31] "cleaner"                   "cling film/bags"          
##  [33] "cocoa drinks"              "coffee"                   
##  [35] "condensed milk"            "cooking chocolate"        
##  [37] "cookware"                  "cream"                    
##  [39] "cream cheese"              "curd"                     
##  [41] "curd cheese"               "decalcifier"              
##  [43] "dental care"               "dessert"                  
##  [45] "detergent"                 "dish cleaner"             
##  [47] "dishes"                    "dog food"                 
##  [49] "domestic eggs"             "female sanitary products" 
##  [51] "finished products"         "fish"                     
##  [53] "flour"                     "flower (seeds)"           
##  [55] "flower soil/fertilizer"    "frankfurter"              
##  [57] "frozen chicken"            "frozen dessert"           
##  [59] "frozen fish"               "frozen fruits"            
##  [61] "frozen meals"              "frozen potato products"   
##  [63] "frozen vegetables"         "fruit/vegetable juice"    
##  [65] "grapes"                    "hair spray"               
##  [67] "ham"                       "hamburger meat"           
##  [69] "hard cheese"               "herbs"                    
##  [71] "honey"                     "house keeping products"   
##  [73] "hygiene articles"          "ice cream"                
##  [75] "instant coffee"            "Instant food products"    
##  [77] "jam"                       "ketchup"                  
##  [79] "kitchen towels"            "kitchen utensil"          
##  [81] "light bulbs"               "liqueur"                  
##  [83] "liquor"                    "liquor (appetizer)"       
##  [85] "liver loaf"                "long life bakery product" 
##  [87] "make up remover"           "male cosmetics"           
##  [89] "margarine"                 "mayonnaise"               
##  [91] "meat"                      "meat spreads"             
##  [93] "misc. beverages"           "mustard"                  
##  [95] "napkins"                   "newspapers"               
##  [97] "nut snack"                 "nuts/prunes"              
##  [99] "oil"                       "onions"                   
## [101] "organic products"          "organic sausage"          
## [103] "other vegetables"          "packaged fruit/vegetables"
## [105] "pasta"                     "pastry"                   
## [107] "pet care"                  "photo/film"               
## [109] "pickled vegetables"        "pip fruit"                
## [111] "popcorn"                   "pork"                     
## [113] "pot plants"                "potato products"          
## [115] "preservation products"     "processed cheese"         
## [117] "prosecco"                  "pudding powder"           
## [119] "ready soups"               "red/blush wine"           
## [121] "rice"                      "roll products"            
## [123] "rolls/buns"                "root vegetables"          
## [125] "rubbing alcohol"           "rum"                      
## [127] "salad dressing"            "salt"                     
## [129] "salty snack"               "sauces"                   
## [131] "sausage"                   "seasonal products"        
## [133] "semi-finished bread"       "shopping bags"            
## [135] "skin care"                 "sliced cheese"            
## [137] "snack products"            "soap"                     
## [139] "soda"                      "soft cheese"              
## [141] "softener"                  "sound storage medium"     
## [143] "soups"                     "sparkling wine"           
## [145] "specialty bar"             "specialty cheese"         
## [147] "specialty chocolate"       "specialty fat"            
## [149] "specialty vegetables"      "spices"                   
## [151] "spread cheese"             "sugar"                    
## [153] "sweet spreads"             "syrup"                    
## [155] "tea"                       "tidbits"                  
## [157] "toilet cleaner"            "tropical fruit"           
## [159] "turkey"                    "UHT-milk"                 
## [161] "vinegar"                   "waffles"                  
## [163] "whipped/sour cream"        "whisky"                   
## [165] "white bread"               "white wine"               
## [167] "whole milk"                "yogurt"                   
## [169] "zwieback"
summary(groceries)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics
image(groceries)

The inspect and summary functions show us the unique item names and some data on item frequency, row/column counts, and element length distribution.

These data don’t translate well to arules’ image() function as the matrix image created is too long due to the imbalance of 10,000 transactions to 169 items.

itemFrequencyPlot(groceries, topN=25,  cex.names=1)

Here we can see the frequency of the top 25 items in descending order.

Next I will use the A-Priori algorithm but, having tried already, the default parameter values don’t generate any rules. I lowered the support value to 0.01 and then to 0.001. This latter support value produced 777 rules with a 0.75 confidence value cut-off. We will look at the top ten by lift.

rules <- apriori(groceries, 
                        parameter = list(supp=0.001, conf=0.75, 
                                         maxlen=10, 
                                         minlen=2,
                                         target= "rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.75    0.1    1 none FALSE            TRUE       5   0.001      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [777 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
inspect(head(rules, n = 10, by = "lift"))
##      lhs                         rhs                   support confidence    coverage      lift count
## [1]  {liquor,                                                                                        
##       red/blush wine}         => {bottled beer}    0.001931876  0.9047619 0.002135231 11.235269    19
## [2]  {citrus fruit,                                                                                  
##       fruit/vegetable juice,                                                                         
##       other vegetables,                                                                              
##       soda}                   => {root vegetables} 0.001016777  0.9090909 0.001118454  8.340400    10
## [3]  {oil,                                                                                           
##       other vegetables,                                                                              
##       tropical fruit,                                                                                
##       whole milk,                                                                                    
##       yogurt}                 => {root vegetables} 0.001016777  0.9090909 0.001118454  8.340400    10
## [4]  {citrus fruit,                                                                                  
##       fruit/vegetable juice,                                                                         
##       grapes}                 => {tropical fruit}  0.001118454  0.8461538 0.001321810  8.063879    11
## [5]  {other vegetables,                                                                              
##       rice,                                                                                          
##       whole milk,                                                                                    
##       yogurt}                 => {root vegetables} 0.001321810  0.8666667 0.001525165  7.951182    13
## [6]  {oil,                                                                                           
##       other vegetables,                                                                              
##       tropical fruit,                                                                                
##       whole milk}             => {root vegetables} 0.001321810  0.8666667 0.001525165  7.951182    13
## [7]  {ham,                                                                                           
##       other vegetables,                                                                              
##       pip fruit,                                                                                     
##       yogurt}                 => {tropical fruit}  0.001016777  0.8333333 0.001220132  7.941699    10
## [8]  {beef,                                                                                          
##       citrus fruit,                                                                                  
##       other vegetables,                                                                              
##       tropical fruit}         => {root vegetables} 0.001016777  0.8333333 0.001220132  7.645367    10
## [9]  {fruit/vegetable juice,                                                                         
##       grapes,                                                                                        
##       other vegetables}       => {tropical fruit}  0.001118454  0.7857143 0.001423488  7.487888    11
## [10] {bottled water,                                                                                 
##       other vegetables,                                                                              
##       root vegetables,                                                                               
##       whole milk,                                                                                    
##       yogurt}                 => {tropical fruit}  0.001118454  0.7857143 0.001423488  7.487888    11

Tropical fruit and root vegetables appear to be the rhs for many of the high lift rules. When filtering for confidence above 90%, yogurt also appears frequently on the rhs. The bottled beer rule has the highest lift and a relatively high confidence. Lift is helpful as it corrects for the popularity of an item – yogurt’s appearance in many high confidence/low lift rules may be an indicator of its high popularity.

library(arulesViz)

subrules <- head(rules, n = 10, by = "lift")

plot(subrules, method = "graph",  engine = "htmlwidget")

This graph helps us see the associations between items as defined by our rules. Cluster analysis helps to visualize the proximity of data objects plotted on a coordinate plane, which is similar to what this graph is showing. By selecting an item in the graph, all of its important relationships are highlighted, helping us visualize the impact of each item or rule on other, adjacent objects.