I’ll follow the guidelines of this page:
https://www.kirenz.com/post/2020-05-14-r-association-rule-mining/#association-rules
Though rather than convert the CSV into a list of character vectors as mentioned in the article, I will use read.transactions() from arules.
library(arules)
groceries <- read.transactions("/Users/harris/ds_masters/DATA624/market_basket/GroceryDataSet.csv", sep = ",")
dim(groceries)
## [1] 9835 169
itemLabels(groceries)
## [1] "abrasive cleaner" "artif. sweetener"
## [3] "baby cosmetics" "baby food"
## [5] "bags" "baking powder"
## [7] "bathroom cleaner" "beef"
## [9] "berries" "beverages"
## [11] "bottled beer" "bottled water"
## [13] "brandy" "brown bread"
## [15] "butter" "butter milk"
## [17] "cake bar" "candles"
## [19] "candy" "canned beer"
## [21] "canned fish" "canned fruit"
## [23] "canned vegetables" "cat food"
## [25] "cereals" "chewing gum"
## [27] "chicken" "chocolate"
## [29] "chocolate marshmallow" "citrus fruit"
## [31] "cleaner" "cling film/bags"
## [33] "cocoa drinks" "coffee"
## [35] "condensed milk" "cooking chocolate"
## [37] "cookware" "cream"
## [39] "cream cheese" "curd"
## [41] "curd cheese" "decalcifier"
## [43] "dental care" "dessert"
## [45] "detergent" "dish cleaner"
## [47] "dishes" "dog food"
## [49] "domestic eggs" "female sanitary products"
## [51] "finished products" "fish"
## [53] "flour" "flower (seeds)"
## [55] "flower soil/fertilizer" "frankfurter"
## [57] "frozen chicken" "frozen dessert"
## [59] "frozen fish" "frozen fruits"
## [61] "frozen meals" "frozen potato products"
## [63] "frozen vegetables" "fruit/vegetable juice"
## [65] "grapes" "hair spray"
## [67] "ham" "hamburger meat"
## [69] "hard cheese" "herbs"
## [71] "honey" "house keeping products"
## [73] "hygiene articles" "ice cream"
## [75] "instant coffee" "Instant food products"
## [77] "jam" "ketchup"
## [79] "kitchen towels" "kitchen utensil"
## [81] "light bulbs" "liqueur"
## [83] "liquor" "liquor (appetizer)"
## [85] "liver loaf" "long life bakery product"
## [87] "make up remover" "male cosmetics"
## [89] "margarine" "mayonnaise"
## [91] "meat" "meat spreads"
## [93] "misc. beverages" "mustard"
## [95] "napkins" "newspapers"
## [97] "nut snack" "nuts/prunes"
## [99] "oil" "onions"
## [101] "organic products" "organic sausage"
## [103] "other vegetables" "packaged fruit/vegetables"
## [105] "pasta" "pastry"
## [107] "pet care" "photo/film"
## [109] "pickled vegetables" "pip fruit"
## [111] "popcorn" "pork"
## [113] "pot plants" "potato products"
## [115] "preservation products" "processed cheese"
## [117] "prosecco" "pudding powder"
## [119] "ready soups" "red/blush wine"
## [121] "rice" "roll products"
## [123] "rolls/buns" "root vegetables"
## [125] "rubbing alcohol" "rum"
## [127] "salad dressing" "salt"
## [129] "salty snack" "sauces"
## [131] "sausage" "seasonal products"
## [133] "semi-finished bread" "shopping bags"
## [135] "skin care" "sliced cheese"
## [137] "snack products" "soap"
## [139] "soda" "soft cheese"
## [141] "softener" "sound storage medium"
## [143] "soups" "sparkling wine"
## [145] "specialty bar" "specialty cheese"
## [147] "specialty chocolate" "specialty fat"
## [149] "specialty vegetables" "spices"
## [151] "spread cheese" "sugar"
## [153] "sweet spreads" "syrup"
## [155] "tea" "tidbits"
## [157] "toilet cleaner" "tropical fruit"
## [159] "turkey" "UHT-milk"
## [161] "vinegar" "waffles"
## [163] "whipped/sour cream" "whisky"
## [165] "white bread" "white wine"
## [167] "whole milk" "yogurt"
## [169] "zwieback"
summary(groceries)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
image(groceries)
The inspect and summary functions show us the unique item names and some data on item frequency, row/column counts, and element length distribution.
These data don’t translate well to arules’ image() function as the matrix image created is too long due to the imbalance of 10,000 transactions to 169 items.
itemFrequencyPlot(groceries, topN=25, cex.names=1)
Here we can see the frequency of the top 25 items in descending order.
Next I will use the A-Priori algorithm but, having tried already, the default parameter values don’t generate any rules. I lowered the support value to 0.01 and then to 0.001. This latter support value produced 777 rules with a 0.75 confidence value cut-off. We will look at the top ten by lift.
rules <- apriori(groceries,
parameter = list(supp=0.001, conf=0.75,
maxlen=10,
minlen=2,
target= "rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.75 0.1 1 none FALSE TRUE 5 0.001 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [777 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(head(rules, n = 10, by = "lift"))
## lhs rhs support confidence coverage lift count
## [1] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 0.002135231 11.235269 19
## [2] {citrus fruit,
## fruit/vegetable juice,
## other vegetables,
## soda} => {root vegetables} 0.001016777 0.9090909 0.001118454 8.340400 10
## [3] {oil,
## other vegetables,
## tropical fruit,
## whole milk,
## yogurt} => {root vegetables} 0.001016777 0.9090909 0.001118454 8.340400 10
## [4] {citrus fruit,
## fruit/vegetable juice,
## grapes} => {tropical fruit} 0.001118454 0.8461538 0.001321810 8.063879 11
## [5] {other vegetables,
## rice,
## whole milk,
## yogurt} => {root vegetables} 0.001321810 0.8666667 0.001525165 7.951182 13
## [6] {oil,
## other vegetables,
## tropical fruit,
## whole milk} => {root vegetables} 0.001321810 0.8666667 0.001525165 7.951182 13
## [7] {ham,
## other vegetables,
## pip fruit,
## yogurt} => {tropical fruit} 0.001016777 0.8333333 0.001220132 7.941699 10
## [8] {beef,
## citrus fruit,
## other vegetables,
## tropical fruit} => {root vegetables} 0.001016777 0.8333333 0.001220132 7.645367 10
## [9] {fruit/vegetable juice,
## grapes,
## other vegetables} => {tropical fruit} 0.001118454 0.7857143 0.001423488 7.487888 11
## [10] {bottled water,
## other vegetables,
## root vegetables,
## whole milk,
## yogurt} => {tropical fruit} 0.001118454 0.7857143 0.001423488 7.487888 11
Tropical fruit and root vegetables appear to be the rhs for many of the high lift rules. When filtering for confidence above 90%, yogurt also appears frequently on the rhs. The bottled beer rule has the highest lift and a relatively high confidence. Lift is helpful as it corrects for the popularity of an item – yogurt’s appearance in many high confidence/low lift rules may be an indicator of its high popularity.
library(arulesViz)
subrules <- head(rules, n = 10, by = "lift")
plot(subrules, method = "graph", engine = "htmlwidget")
This graph helps us see the associations between items as defined by our rules. Cluster analysis helps to visualize the proximity of data objects plotted on a coordinate plane, which is similar to what this graph is showing. By selecting an item in the graph, all of its important relationships are highlighted, helping us visualize the impact of each item or rule on other, adjacent objects.