Github repo | portfolio | Blog
Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached.
Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
# Load Groceries Dataset
groceries_df <- read.csv("GroceryDataSet.csv", header = FALSE,
na.strings="",
stringsAsFactors=FALSE )
# Add an id column
groceries_df$id <- seq(nrow(groceries_df))
groceries_df <- groceries_df %>%
mutate(across(where(is.character), str_trim))
print(paste(nrow(groceries_df), ncol(groceries_df)))
## [1] "9835 33"
groceries_long <- melt(groceries_df, id.vars="id")
groceries_trans <- as(lapply(split(groceries_long$value, groceries_long$id), unique), "transactions")
inspect(groceries_trans[1:5])
## items transactionID
## [1] {citrus fruit,
## margarine,
## ready soups,
## semi-finished bread} 1
## [2] {coffee,
## tropical fruit,
## yogurt} 2
## [3] {whole milk} 3
## [4] {cream cheese,
## meat spreads,
## pip fruit,
## yogurt} 4
## [5] {condensed milk,
## long life bakery product,
## other vegetables,
## whole milk} 5
itemFrequencyPlot(groceries_trans,
topN=30,
type="absolute",
main='Absolute Item Frequency Plot',
ylab="Item Frequency (Absolute)")
I am filtering to only include rules with 1~4 items on the RHS. Rules with 5+ items are probably less useful and would increase noise. Given the large number of receipts, I’m using a lower support
threshold and I experimented with different confidence
values to get what look like reasonable rules.
Note: I tried to remove redundant rules using examples from documentation, but for some reason (I don’t know why), the filtering treated ALL rules as redundant. After a fair bit of troubleshooting, I just removed that step for now. This would certainly be something I would circle back to if I were using this analysis for real.
rules <- apriori(groceries_trans,
parameter = list(supp = 0.001,
conf = 0.15,
minlen = 2,
maxlen = 5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.15 0.1 1 none FALSE TRUE 5 0.001 2
## maxlen target ext
## 5 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5
## Warning in apriori(groceries_trans, parameter = list(supp = 0.001, conf =
## 0.15, : Mining stopped (maxlen reached). Only patterns up to a length of 5
## returned!
## done [0.01s].
## writing ... [26760 rule(s)] done [0.01s].
## creating S4 object ... done [0.01s].
# find redundant rules
# subset.matrix <- is.subset(rules, rules)
# subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
# redundant <- colSums(subset.matrix, na.rm=T) >= 1
# remove redundant rules
# rules.pruned <- rules[!redundant]
# rules <- rules.pruned
rules <- sort(rules, by="lift", decreasing=TRUE)
inspect(rules[1:10])
## lhs rhs support confidence coverage lift count
## [1] {bottled beer,
## red/blush wine} => {liquor} 0.0019 0.40 0.0049 36 19
## [2] {hamburger meat,
## soda} => {Instant food products} 0.0012 0.21 0.0058 26 12
## [3] {ham,
## white bread} => {processed cheese} 0.0019 0.38 0.0051 23 19
## [4] {other vegetables,
## root vegetables,
## whole milk,
## yogurt} => {rice} 0.0013 0.17 0.0078 22 13
## [5] {bottled beer,
## liquor} => {red/blush wine} 0.0019 0.41 0.0047 21 19
## [6] {Instant food products,
## soda} => {hamburger meat} 0.0012 0.63 0.0019 19 12
## [7] {curd,
## sugar} => {flour} 0.0011 0.32 0.0035 19 11
## [8] {baking powder,
## sugar} => {flour} 0.0010 0.31 0.0033 18 10
## [9] {processed cheese,
## white bread} => {ham} 0.0019 0.46 0.0042 18 19
## [10] {fruit/vegetable juice,
## ham} => {processed cheese} 0.0011 0.29 0.0039 17 11
Stepping back, if this were being used in a real setting, I would explore generating a score
using a geometric mean of support
, confidence
and lift
… then sorting my rules based on this new metric, score
. Ideally we want rules that are both prevalent and meaningful. A large lift
with low support
not be actionable. Alternatively, a high confidence
with low support
or low lift
would also not be as actionable. Ideally, we want rules that identify both high lift
and high support
.
Per instructions, I’ve just sorted based on lift
for now; however, lift
alone probably isn’t the best approach.
s <- groceries_trans[,itemFrequency(groceries_trans) > 0.02]
d_jaccard <- dissimilarity(s, which = "items", method="affinity")
plot(hclust(d_jaccard, method = "ward.D2"), main = "Dendrogram for Items")
d_affinity <- dissimilarity(rules[1:20],
method = "affinity",
args = list(transactions = groceries_trans))
hc <- hclust(d_affinity, method = "ward.D2")
plot(hc, main = "Dendrogram for Rules (Affinity)")
## set of 26760 rules
## lhs rhs support confidence coverage lift count
## [1] {hamburger meat,
## soda} => {Instant food products} 0.0012 0.21 0.0058 26 12
## [2] {other vegetables,
## root vegetables,
## whole milk,
## yogurt} => {rice} 0.0013 0.17 0.0078 22 13
## [3] {Instant food products,
## soda} => {hamburger meat} 0.0012 0.63 0.0019 19 12
## [4] {curd,
## sugar} => {flour} 0.0011 0.32 0.0035 19 11
## [5] {baking powder,
## sugar} => {flour} 0.0010 0.31 0.0033 18 10
## [6] {margarine,
## sugar} => {flour} 0.0016 0.30 0.0055 17 16
## [7] {root vegetables,
## sugar,
## whole milk} => {flour} 0.0010 0.29 0.0035 17 10
## [8] {popcorn,
## soda} => {salty snack} 0.0012 0.63 0.0019 17 12
## [9] {baking powder,
## flour} => {sugar} 0.0010 0.56 0.0018 16 10
## [10] {sugar,
## whipped/sour cream} => {baking powder} 0.0013 0.27 0.0049 15 13