Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’. That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached. Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
library(arules)
library(arulesViz)
library(cluster)
library(factoextra)
transaction <- read.transactions('https://raw.githubusercontent.com/Jlok17/2022MSDS/main/Source/Data%20624/GroceryDataSet.csv', sep = ',', header = FALSE)
head(transaction)
## transactions in sparse format with
## 6 transactions (rows) and
## 169 items (columns)
str(transaction)
## Formal class 'transactions' [package "arules"] with 3 slots
## ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
## .. .. ..@ i : int [1:43367] 29 88 118 132 33 157 167 166 38 91 ...
## .. .. ..@ p : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ...
## .. .. ..@ Dim : int [1:2] 169 9835
## .. .. ..@ Dimnames:List of 2
## .. .. .. ..$ : NULL
## .. .. .. ..$ : NULL
## .. .. ..@ factors : list()
## ..@ itemInfo :'data.frame': 169 obs. of 1 variable:
## .. ..$ labels: chr [1:169] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "baby food" ...
## ..@ itemsetInfo:'data.frame': 0 obs. of 0 variables
For creating the rules, I set up a minimum threshold to be considered. I use a minimum value of .002 for support since I wanted the item set to appear in at least of 0.2% of all the transaction so the very rare item combinations are filtered out being statistically insignificant but still shows the less common item combination. I utilize a confidence value of 0.5 since I wanted the reliability of the inference to be a balance where the resulting consequent is the result of the antecedent at least 50% of the time as well as not overly caution or too lenient.
# Generating rules
rules <- apriori(transaction, parameter = list(supp = 0.002, conf = 0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.002 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 19
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [147 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.01s].
## writing ... [1098 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Top 10 Sorted by lift
rules_sorted <- sort(rules, by = "lift", decreasing = TRUE)
inspect(head(rules_sorted, 10))
## lhs rhs support confidence coverage lift count
## [1] {butter,
## hard cheese} => {whipped/sour cream} 0.002033554 0.5128205 0.003965430 7.154028 20
## [2] {beef,
## citrus fruit,
## other vegetables} => {root vegetables} 0.002135231 0.6363636 0.003355363 5.838280 21
## [3] {citrus fruit,
## other vegetables,
## tropical fruit,
## whole milk} => {root vegetables} 0.003152008 0.6326531 0.004982206 5.804238 31
## [4] {citrus fruit,
## frozen vegetables,
## other vegetables} => {root vegetables} 0.002033554 0.6250000 0.003253686 5.734025 20
## [5] {beef,
## other vegetables,
## tropical fruit} => {root vegetables} 0.002745297 0.6136364 0.004473818 5.629770 27
## [6] {bottled water,
## root vegetables,
## yogurt} => {tropical fruit} 0.002236909 0.5789474 0.003863752 5.517391 22
## [7] {herbs,
## other vegetables,
## whole milk} => {root vegetables} 0.002440264 0.6000000 0.004067107 5.504664 24
## [8] {grapes,
## pip fruit} => {tropical fruit} 0.002135231 0.5675676 0.003762074 5.408941 21
## [9] {herbs,
## yogurt} => {root vegetables} 0.002033554 0.5714286 0.003558719 5.242537 20
## [10] {beef,
## other vegetables,
## soda} => {root vegetables} 0.002033554 0.5714286 0.003558719 5.242537 20
plot(rules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
# Item Frequency Matrix
itemFrequencyPlot(transaction, topN = 20)
So briefly, lift in rule mining is the metric in order to measure the strength of the rule association. Value of Lift greater than 1 show cases that the antecedent and consequent occur together more often than if they were independent and show cases a positive relationship between the 2. If lift is equal to 1, it would suggest that the antecedent and consequent are independent of each other. If lift is less than 1, it would mean that they occur less frequent together than expected which would mean that the relationship between the 2 items are substitute. So for Rule 1 we see that “Butter, Hard Cheese” correlates to “whipped/Sour Cream” with a value of 7.154. This means that these items are bough together 7.154 times more than the likelihood of buying “whipped/sour cream” randomly. This information can be used in order to determine if there is a possible marketing strategies with coupons and promotions or even go to the point of creating a store layout where these items are far from each other so the consumer has to travel through more of the store which could generate more revenue.
I utilized factoextra for clustering but didn’t seem to
yield any good results. As this will continue being a work in progress
for me.
# Hierarchical Clustering
# Converting transactions to a binary matrix
itemMatrix <- as(transaction, "ngCMatrix")
# Transpose the itemMatrix
transposedMatrix <- t(itemMatrix)
# Computing Euclidean distance
dist_items <- dist(transposedMatrix, method = "euclidean")
hclust_items <- hclust(dist_items)
plot(hclust_items)
# Cutting Tree to Clusters
clusters <- cutree(hclust_items, k = 5)
fviz_dend(hclust_items, k = 5, rect = TRUE, cex = 0.5)