Github repo | portfolio | Blog
Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.
Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
# Load Groceries Dataset
groceries_df <- read.csv("GroceryDataSet.csv", header = FALSE,
na.strings="",
stringsAsFactors=FALSE )
# Add an id column
groceries_df$id <- seq(nrow(groceries_df))
groceries_df <- groceries_df %>%
mutate(across(where(is.character), str_trim))
print(paste(nrow(groceries_df), ncol(groceries_df)))## [1] "9835 33"
groceries_long <- melt(groceries_df, id.vars="id")
groceries_trans <- as(lapply(split(groceries_long$value, groceries_long$id), unique), "transactions")
inspect(groceries_trans[1:5])## items transactionID
## [1] {citrus fruit,
## margarine,
## ready soups,
## semi-finished bread} 1
## [2] {coffee,
## tropical fruit,
## yogurt} 2
## [3] {whole milk} 3
## [4] {cream cheese,
## meat spreads,
## pip fruit,
## yogurt} 4
## [5] {condensed milk,
## long life bakery product,
## other vegetables,
## whole milk} 5
itemFrequencyPlot(groceries_trans,
topN=30,
type="absolute",
main='Absolute Item Frequency Plot',
ylab="Item Frequency (Absolute)")I am filtering to only include rules with 1~4 items on the RHS. Rules with 5+ items are probably less useful and would increase noise. Given the large number of receipts, I’m using a lower support threshold and I experimented with different confidence values to get what look like reasonable rules.
Note: I tried to remove redundant rules using examples from documentation, but for some reason (I don’t know why), the filtering treated ALL rules as redundant. After a fair bit of troubleshooting, I just removed that step for now. This would certainly be something I would circle back to if I were using this analysis for real.
rules <- apriori(groceries_trans,
parameter = list(supp = 0.001,
conf = 0.15,
minlen = 2,
maxlen = 5))## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.15 0.1 1 none FALSE TRUE 5 0.001 2
## maxlen target ext
## 5 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5
## Warning in apriori(groceries_trans, parameter = list(supp = 0.001, conf =
## 0.15, : Mining stopped (maxlen reached). Only patterns up to a length of 5
## returned!
## done [0.01s].
## writing ... [26760 rule(s)] done [0.01s].
## creating S4 object ... done [0.01s].
# find redundant rules
# subset.matrix <- is.subset(rules, rules)
# subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
# redundant <- colSums(subset.matrix, na.rm=T) >= 1
# remove redundant rules
# rules.pruned <- rules[!redundant]
# rules <- rules.pruned
rules <- sort(rules, by="lift", decreasing=TRUE)
inspect(rules[1:10])## lhs rhs support confidence coverage lift count
## [1] {bottled beer,
## red/blush wine} => {liquor} 0.0019 0.40 0.0049 36 19
## [2] {hamburger meat,
## soda} => {Instant food products} 0.0012 0.21 0.0058 26 12
## [3] {ham,
## white bread} => {processed cheese} 0.0019 0.38 0.0051 23 19
## [4] {other vegetables,
## root vegetables,
## whole milk,
## yogurt} => {rice} 0.0013 0.17 0.0078 22 13
## [5] {bottled beer,
## liquor} => {red/blush wine} 0.0019 0.41 0.0047 21 19
## [6] {Instant food products,
## soda} => {hamburger meat} 0.0012 0.63 0.0019 19 12
## [7] {curd,
## sugar} => {flour} 0.0011 0.32 0.0035 19 11
## [8] {baking powder,
## sugar} => {flour} 0.0010 0.31 0.0033 18 10
## [9] {processed cheese,
## white bread} => {ham} 0.0019 0.46 0.0042 18 19
## [10] {fruit/vegetable juice,
## ham} => {processed cheese} 0.0011 0.29 0.0039 17 11
Stepping back, if this were being used in a real setting, I would explore generating a score using a geometric mean of support, confidence and lift … then sorting my rules based on this new metric, score. Ideally we want rules that are both prevalent and meaningful. A large lift with low support not be actionable. Alternatively, a high confidence with low support or low lift would also not be as actionable. Ideally, we want rules that identify both high lift and high support.
Per instructions, I’ve just sorted based on lift for now; however, lift alone probably isn’t the best approach.
s <- groceries_trans[,itemFrequency(groceries_trans) > 0.02]
d_jaccard <- dissimilarity(s, which = "items", method="affinity")
plot(hclust(d_jaccard, method = "ward.D2"), main = "Dendrogram for Items")d_affinity <- dissimilarity(rules[1:20],
method = "affinity",
args = list(transactions = groceries_trans))
hc <- hclust(d_affinity, method = "ward.D2")
plot(hc, main = "Dendrogram for Rules (Affinity)") ## set of 26760 rules
## lhs rhs support confidence coverage lift count
## [1] {hamburger meat,
## soda} => {Instant food products} 0.0012 0.21 0.0058 26 12
## [2] {other vegetables,
## root vegetables,
## whole milk,
## yogurt} => {rice} 0.0013 0.17 0.0078 22 13
## [3] {Instant food products,
## soda} => {hamburger meat} 0.0012 0.63 0.0019 19 12
## [4] {curd,
## sugar} => {flour} 0.0011 0.32 0.0035 19 11
## [5] {baking powder,
## sugar} => {flour} 0.0010 0.31 0.0033 18 10
## [6] {margarine,
## sugar} => {flour} 0.0016 0.30 0.0055 17 16
## [7] {root vegetables,
## sugar,
## whole milk} => {flour} 0.0010 0.29 0.0035 17 10
## [8] {popcorn,
## soda} => {salty snack} 0.0012 0.63 0.0019 17 12
## [9] {baking powder,
## flour} => {sugar} 0.0010 0.56 0.0018 16 10
## [10] {sugar,
## whipped/sour cream} => {baking powder} 0.0013 0.27 0.0049 15 13